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ABSTRACT 


This thesis applies Latent Dirichlet Allocation (LDA) to the problem of topic and topic change 
in conversational threads using e-mail. We demonstrate that LDA can be used to successfully 
classify raw e-mail messages with threads to which they belong, and compare the results with 
those for processed threads, where quoted and reply text have been removed. Raw thread clas- 
sification performs better, but processed threads show promise. We then present two new, un- 
supervised techniques for identifying topic change in e-mail. The first is a keyword clustering 
approach using LDA and DBSCAN to identify clusters of topics, and transition points between 
them. The second is a sliding window technique which assesses the current topic for every 
window, identifying transition points. The keyword clustering performs better than the sliding 


window approach. Both can be used as a baseline for future work. 
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CHAPTER 1: 
Introduction 





1.1 Introduction 

This thesis applies Latent Dirichlet Allocation (LDA) to the problem of topic and topic change 
in conversational threads using e-mail. Using LDA, probabilistic models are built for the topics 
within the corpus, and these topics are used both to cluster e-mails with their original threads, 


as well as to study where these topics change within threads. 


We demonstrate that LDA can be used to successfully classify raw e-mail messages with threads 
to which they belong, and compare the results with those for processed threads, where quoted 
and reply text have been removed. 


We then present two new, unsupervised techniques for identifying topic change in e-mail. The 
first takes keywords identified by the LDA algorithm and clusters them to identify topics in 
threaded conversations. Topic changes are then by definition the transitions between clusters. 
The second technique uses a sliding window over each thread. For each window, the current 
topic is calculated using the LDA word-topic weights, and a note is made when this topic 


changes. 


1.2 Motivation 

The resolution, accuracy and availability of sociological information is increasing at a rapid 
rate. So, too, is our ability to quantify that data. Never in history have we had such fine grained, 
quantitative measures of individuals’ actions across such vast and diverse swaths of people. 
The consistency and benevolent nature of this data goes orders of magnitude beyond what was 
possible even ten years ago. To have compatible and consistent data sets across large groups of 


people has historically been messy and difficult, if not impossible. 


Today, not only has this type of data become more available, it has become so ubiquitous that it 
requires us to develop new approaches to studying them. Machine learning algorithms, includ- 


ing probabilistic topic models, are a promising approach. 


Mining through these large data collections to detect consistent patterns and trends can give us 


empirically verifiable data about the nature of human social dynamics and interactions. Mining 


through them for anomalies and deviations from some reference point can suggest events or 


individuals worthy of further study. 


For example, it is a well known problem in digital forensics that investigators are often given 
large hard drives to analyze, with little or no indication of where to start or what the important 
documents are. A personal hard drive in 2009 can easily be half a terabyte in size. Much of 
this data can be natural language content, with conversational content interspersed throughout. 


Without statistical methods to handle all this data, scientists and investigators alike are lost. 


E-Mail specifically, is a rich source of information about the dynamics of information flow in 
social networks. It is inherently structured, with a variety of time, author, and content-related 
metadata fields. Its increase in availability on phones and online has increased its prevalence as 
an easy and accessible communications medium for the full range of online tasks, from com- 
munications to grocery lists and event planning. Storage trends in reliability and affordability 


mean people are retaining more and more of their e-mail, and for longer periods of time. 


E-Mail presents an analytical challenge different from longer documents such as articles, reports 
or books. E-Mail bodies can vary in length from a few words to multiple paragraphs. Their often 


terse nature can push the limits of most content- and topic-analysis algorithms in use today. 


Interestingly, e-mail might not be such a rich source of information for long. As technology- 
based communications diversify into more customized and better suited platforms such as in- 
stant messenger, social networking sites, and wikis, we may no longer have the luxury of a 
single, de-facto platform for the exchange of content between individuals. We should take the 
opportunity to study e-mail corpora now, while they are still in widespread use. 


Motivated by a desire to better understand characteristic patterns of human interaction, this 
thesis focuses on the question of how conversations evolve. Using e-mail for its threaded con- 
versational nature, topic and topic change are studied are studied by examining patterns of word 
usage in e-mails. Specifically, this work uses new and emerging techniques in data mining and 
machine learning to show that we can build accurate models of topics in e-mail threads, us- 
ing state-of-the-art probabilistic techniques. It then extends these techniques to the problem of 
identifying topic change within threads, providing a baseline for what is possible with current 


methodologies, and identifying directions for future work in this area. 


1.3. Applications 


In Marti Hearst’s paper “Text Tiling” (discussed further in Chapter 2), she points out that topic 
change can be seen as an inverted approach to topic detection. If we know where topic changes, 
we also know the boundaries of topics [13]. Beyond a study of topic boundaries, the nature and 
frequency of topic change is interesting in its own right, giving us insight into the dynamics of 


human communications. 


Topic change is useful for understanding the ways that content or opinion change over time in 
personal relationships, public sentiment, news coverage, and the evolution of interests within 
populations. Digital forensic investigations are often interested in the point at which the na- 
ture or content of a relationship changes. The ability to detect topic change would support 
investigations of sexual predators, where a conversation often starts out platonic and then turns 
sexual [21], as well as investigations into the techniques of recruitment for criminal or religious 


activity. 


If social networking sites can identify the point at which new topics emerge in individual or 
aggregate discussions, they can make better recommendations, and even serve better advertise- 


ments. 


Companies would often like greater insight into what their employees are talking about. Are 
discussions beginning on the topic of work, and then frequently evolving towards more social 
topics? Are managers confusing or distracting their group by assigning too many projects at 
once, or changing team goals too often? Could this be correlated to successfully or unsuccess- 


fully managed projects? 


Looking at the more subtle aspects of topic change, the ability to observe changes from positive 
to negative sentiments about a topic, or discussion of one policy issue moving into another (for 
example, as Congress’ schedule changes) would be very interesting. This has huge potential 
to help policy makers (in fact, all of us) understand public sentiment. The ability to correlate 
the actions of leadership in any group, with the presence, speed and nature of its constituency’s 
reactions, or the path of a topic’s flow through different demographics, could greatly aid in 


improving feedback loops, and forming more effective policies. 


Topic change can also be used to study how our understanding or priorities within certain top- 
ics change. Blei applied dynamic topic models applied to 100 years of the journal Science, to 


see how our understanding of topics such as quantum physics and neuroscience had changed 


[2]. Similarly, many government organizations such as the National Academy of Sciences, and 
the National Aeronautics and Space Administration (NASA), undertake decadal surveys of re- 
search priorities. The study of topic change can help us to see not only how those priorities have 
changed between surveys, but could also be used to find differences between identified and im- 
plemented priorities, through documents associated with actual missions or studies undertaken. 
Automated techniques are useful here, both because of the quantity of data, and because auto- 
mated techniques are objective in a way that can be difficult for humans. 


More theoretical research could look for characteristic patterns in the evolution of ideas, opin- 
ions or populations, captured in text over time. Sociologically, the ability to measure patterns 
of communications across cultures, age groups, dispositions, or mental conditions could all 
provide insights into ways to improve communication or simply improve insights in their na- 
ture. How often do topics or popular opinion change within specific realms, generations, or 
nations? To what degree does the average conversation change topic? Do conversations fre- 
quently change topic abruptly, or in a slow and meandering fashion? When they do or do not, 
what does it imply about the participants, their relationships, or the underlying topics them- 


selves? 


1.4 Relationship to Space Exploration 
As this is NASA-sponsored research, we touch briefly upon how this work can support ongoing 


efforts to advance exploration and settlement of the solar system. 


Conversational document clustering can be applied to transcripts of verbal communications, 
and written communications, of astronauts on board the space station. Psychological and soci- 
ological studies focused on stress and quality of life factors, collaboration dynamics, and team 
effectiveness could benefit from the ability to tie together threaded conversations. Similarly, 
insight into the dynamics of topic change, as described above, could also support our insight 
into requirements for successful team dynamics, in space or otherwise. On a more immediate 
level, studies of organization priorities and undertakings, as mentioned above, could help insti- 
tutions begin to understand where discrepancies arise between intention and implementation. 
Like many large organizations, NASA is no stranger to these questions. The larger and more 
distributed a group (and NASA has over 50,000 people across 10 centers [8]), the harder it can 
be to make implementation match intention. More people tends to mean more data, and as our 
understanding of how to measure topic change develops, we can actually examine where and 


how these shifts take place. 


1.5 Outline of this Thesis 

In Chapter 2, an overview of supporting and related work is provided. Chapter 3 goes into the 
mathematics and theory behind the techniques and concepts used in the experiments for this 
thesis, and describes the data processing done to the corpus. Chapter 4 outlines the experiments 


and their results, and Chapter 5 discusses the implications of these results and ideas for future 


work. Chapter 6 offers conclusions. 
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CHAPTER 2: 
Prior and Related Work 





Topic detection and the automated detection of topic change touches on a number of different 


areas in natural language processing, machine learning, and visualization techniques. 


Topic detection has been around in varying forms since the earliest work in natural language 
processing. Techniques such as word-sense disambiguation, clustering, cosine similarity, and 
basic word-counting have been applied with reasonable success to the topical problems of au- 
tomatic summarization, sub-topic clustering, probabilistic topic modeling, and keyword identi- 


fication. 


Topic change as a phenomenon in its own right has been studied in relatively fewer areas, 
although some researchers, such as Hearst [13], propose the detection of topic change as a 
reformulation of the problem of topic detection. Others, such as Blei and Lafferty [2], have 
studied the evolution of topics over decadal time spans as a way to observe changing norms, 


practices, and beliefs as reflected in popular scientific literature. 


A variety of more visual techniques have also been applied to natural language corpora, in 
attempts to capture and represent intuitive notions of conversational evolution, which is another 


way to approach detection of topic change. These are described below. 


2.1 Topic Detection 

At its core, topic change is heavily related to topic detection. State of the art in topic detection 
is currently rooted in probabilistic models which hypothesize a latent topic space, manifested 
in the documents of a corpus; that is, that there is a specific set of topics represented by the 
documents, and each word w has a probability of belonging to each of the topics, z, with a 


given probability. That is, 


si 


S> P(wilei =) Pz =3) (2.1) 


j=l 


x 
E 
I 


(This equation is developed more fully in Section ??). Initial work in probabilistic topic model- 


ing was called Latent Semantic Analysis (LSA), which applied Singular Value Decomposition 


(SVD) to term frequency counts in order to identify documents likely to be of the same topic [6]. 
Probabilistic Latent Semantic Indexing (PLSI) [14] built on LSA by associating a latent topic 
variable with each word, and postulating that the probability of a given word in a document is 
actually given by summing over the probabilities of that word across each latent topic, times the 


probability of each topic given a document. 


By introducing the latent topic variable as a probability distribution, P’S replaced SVD with 
probabilistic mixture models. The output of PLSI is a set of mixing weights, representing the 


contributions of the various topics to the document. 


By representing topics as probability distributions, PLSI provides an intuitive notion of topics 
not as discrete objects which begin and end on a boundary, but as distributions which can, and 


do, mix in with other topics. 


However, while PLSI models words using probabilities across topics, there is no generative 
model of how these probabilities arise, that is, of the topics themselves. The Expectation- 
Maximization algorithm used in PLSI to estimate topic likelihood is, partially as a result, prone 
to overfitting, and the number of parameters can grow linearly with the size of the corpus [3, 
p.2]. 


Blei, Ng, and Jordan attempted to address these shortcomings with a technique called Latent 
Dirichlet Allocation (LDA). This technique represents documents as random probability mix- 
tures over latent topics, but each topic is itself represented as a distribution over words [3]. 
A prior is assumed on the topics, drawn from a Dirichlet distribution. The parameters of this 


Dirichlet are known as the mixing weights. 


BuzzTrack [4] uses cosine similarity between messages as one of the features of its topic detec- 
tion and tracking system. Damashek [5] uses cosine similarity to cluster documents by remov- 
ing the average of a document set from each document to be clustered, and then clustering on the 
remaining values. He finds that good, language-independent results are obtained. However, his 
technique is sensitive to any differences in the document bodies, and these can be sentimental 


as well as topic differences. 


2.2 Topic Change 
Two papers focusing specifically on the notion of topic change are worth noting here. The first is 
Marti Hearst’s “TextTiling” paper [13], which analyzes documents on a paragraph-by-paragraph 


basis and uses “patterns of lexical co-occurrence and distribution” to detect transition from one 
topic to the next. Hearst removes stop words and applies stemming, and then evaluates two 
metrics to formalize the notion of co-occurrence. The first metric uses a normalized dot product 
(or cosine similarity) between two adjacent blocks, where blocks are approximately paragraph- 
length. The second metric is called “vocabulary introduction’, and is calculated as the ratio of 
new words in a given interval, to the length of that interval. Hearst’s results are comparable 
to or better than state of the art at the time. Hearst’s work is applied to journal articles with a 


single set of authors. This differs substantially from the conversational corpus we use here. 


Blei and Lafferty [2] extend Latent Dirichlet Allocation (LDA) with a temporal element in order 
to study the evolution of topics over time. Specifically, LDA makes no assumption about the 
sequential aspect of documents in a corpus; however, in this work, each year is modelled as a 
set of topics, which are themselves a function of the set of topics in the previous year. They 
apply this technique to 120 years of archives of Science. At ten year intervals, they compare 
keywords from specific (pre-defined) categories such as Neuroscience and Quantum Physics, 
and the result is a study of how these keywords change on decadal time frames. Blei and 
Lafferty apply their dynamic topic models to predict the time-evolution of topics, and their 
results show that improved predictive accuracy can be obtained with these dynamic rather than 
static topic models. 


Blei and Lafferty’s work uses set intervals on which they study changes in topic contents, 
whereas we instead attempt to automatically detect when topic change occurs. Their corpus 
contains a pre-defined list of categories, or high level topics, within which they focus their anal- 
ysis of changing keywords. Finally, they examine ten year increments of a larger corpus, while 
we examine changes on the hourly or daily time frame, as represented by mere sentences or 


paragraphs. 


In this work, we explore e-mail, which is terse and informal. It is addressed to an internal 
audience (those copied on the e-mail). In addition, because of the conversational aspect, there 
are often new elements of a thread which lexically have little or nothing in common with the 
previous sentence or e-mail, but which refer to the same topic. These factors provide a rather 


different context for detecting topic change. 


2.3 Conversation Flow 

Innovative methods for representing conversation flow are the focus of many studies on e-mail, 
as well as blog comments and revision controlled documents. Many of these studies develop 
techniques to visually highlight important elements of conversations, and use the viewer as the 
classifier. Their success or failure is judged on the assessment of the viewer. In this sense, most 
of them do not represent algorithms which can be automated, as we are exploring here, but they 


do represent a legitimate approach to representing the evolution of topics and conversation flow. 


Two particularly interesting efforts in this regard can be found in the techniques of history flow 


and theme river. 


History flow [22] visualizes changes in the life cycle of a document by representing each revi- 
sion as a vertical line, and contributions by each author as individually coloured lines between 
these revisions. The thickness and location of each author’s line corresponds to the number of 
words, and location in the document, respectively. If a new contribution is made, a new line is 


started; if one is removed, that line is terminated. 


History flow has been used to visualize edits over time of Wikipedia articles. It demonstrates 
interesting, consistent visual patterns over time corresponding to specific behaviours such as 


edit wars and vandalism. 


Theme river [12] is a visualization technique which hand selects a specific set of key words in a 
series of documents, and represents their relative frequency of occurrence as the thickness of a 
smoothed line. This line “flows” along a time line of the documents in question. The thickness 
is meant to intuitively convey those words in the set which appears frequently in any given 


document. 


Theme river is visually appealing and intuitive in its representation. Although the version im- 
plemented in the paper requires much supervision and hand labeling, one could imagine modi- 
fications to such a system which would automatically determine keywords, and track their role 
as a conversation evolved. These techniques are an alternative way of exploring the notion of 


conversational evolution. 


2.4 E-Mail 


E-Mail corpora have been used extensively for studies as diverse as thread prediction, author- 


ship studies, role identification, spam filtering, topic detection, information synthesis, keyword 
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extraction, and many more. We will not review all applications of e-mail research here; how- 


ever, we touch briefly on the Enron corpus [15] to discuss why it was not used. 


By far, the Enron corpus is the most widely used corpus for research in the area of e-mail [15]. 
It consists of approximately half a million e-mails from 150 users, with attachments removed. 
Although this corpus presents a valuable option for many research areas, there are two principle 


reasons why it was not used here (also identified by [4]): 


e Ground Truth — working with other peoples’ e-mails provides a difficult reference point 
for ground truth, in terms of topic identification, relevance or meaning. In this set of ex- 
periments, using the author’s personal e-mail enabled more knowledgeable and accurate 
interpretation of topics and topic change. 


e Metadata — The experiments in this thesis all revolve around the e-mail thread, which 
are programmatically re-constructed from individual e-mails using the The In-Reply-To 
header field populated by most e-mail systems. The vast majority of e-mails in the Enron 


corpus do not have this header field, which would have prevented thread-based analysis. 
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CHAPTER 3: 
Technical Concepts and Data Processing 





In this chapter we document the techniques, concepts, and technical approaches used in the 
experiments undertaken for this thesis. As the intended audience spans the digital forensics and 
natural language processing communities, as well as graduate students in other fields, certain 


foundational terms and concepts are covered. 


3.1 Technical Concepts 


3.1.1 Natural Language Processing 

In Natural Language Processing (NLP), researchers attempt to give structure to unstructured, or 
natural, language documents, in order to build tools or algorithms which analyze patterns and 
meaning in the content. This is typically done by splitting documents into words, and analyzing 
those words either individually (the ‘bag-of-words’ approach, see below), or as n-grams, sen- 
tences, or paragraphs. The process of identifying word boundaries is called tokenization, and 


the resulting objects are formally called tokens. 


The term ‘bag-of-words’ is used to describe an approach to text processing where the words 
are treated as isolated entities, without regard to their immediate context or order. Conceptu- 
ally, it’s as though the words in a document were thrown into a bag; but more importantly, a 
bag is a technical term that, as opposed to a set, allows for duplication of tokens. For many 
applications this is a useful simplifying approach. This thesis uses the bag-of-words approach 
for topic modeling in several of the experiments. N-grams are ordered sequences of n words 
or characters, and they give additional context to words (or characters) in a document. Because 
the possible number of n-grams in a document with vocabulary of size V is V”, the number 
of possible n-grams for a given vocabulary is (relatively) huge, while the number of actual n- 
grams in a document is low compared to this possibility space. Natural Language data sets, for 
this reason, are often referred to as ‘sparse.’ Naturally, as n increases, so does the sparsity of the 
coverage. Because of this sparsity, the probability of any specific n-gram is very low, and thus 
its presence can be a strong indicator of a feature (such as a topic or author). However, greater 
sparsity may come hand in hand with over-training, and selection of the right n is typically a 


trade off between accuracy and coverage—as is the case with most statistical techniques. 


Whatever the basic unit of analysis, many NLP techniques involve creation of some kind of 
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vector space model. Each word (or n-gram) in the vocabulary represents a dimension, and each 
document can then be represented by a vector. The dimensions correspond to the words or 
n-grams therein, and the value of each dimension corresponds to the number of occurrences of 
that object. In many cases, the value for a given dimension will be 0 if it has not occurred in 


that particular document. 


For example, consider the sentence, “Government data transparency is important for govern- 


ment to function well.” tokenized on white space boundaries. 














Government | | Transparency 





























is |[important ||for, [government!|| td | function ||well 





























Corpus Vocabulary Tokens are normalized in terms of case 
(all converted to lowercase) and counted, 
becoming values. These single tokens 
mndex etm. are also called uni-grams. 
0 A 
1 For 
_ anon Dimension Value 
4. Is Index becomes 
5 Important the dimension 0. 0 
6 People 1. 1 
7 They 2. 1 
8. To 3. 2 
9. Transparency 4. 1 
10. Well 5. 1 
11 Vote 6. 0 
7. 0 
8. 1 
9. 1 
10. 0 
11. 0 


Figure 3.1: Basic word tokenization and vector-space representation for unigrams. 


By converting documents to a vector space representations, the tools of geometry and algebra 
can be applied, and questions of difference and distance between documents become meaning- 
ful. 


3.1.2 Supervised and Unsupervised Learning 


There are two basic categories of machine learning algorithm, supervised and unsupervised. 


Supervised learning involves a problem where the number of classes or groups into which the 
input data is being sorted, is known ahead of time, or determined at some point in the analysis. 


Supervised learning asks, “to which of these n groups does this record or data point belong?” 
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Supervised learning is often associated with classification problems. It is considered supervised 
in the sense that the learning task has examples of correct patterns to work with, in order to build 


its model. Supervised learning is used in the e-mail classification experiment in this thesis. 


In contrast, unsupervised learning involves determining as part of the algorithm, the number 
of classes or groups of the final output, as well as which records fit into which classes. It is 
unsupervised in the sense that the algorithm has no set of known correct examples guiding the 
development of a model. Unsupervised learning is often associated with clustering problems. 
Unsupervised learning is used for the topic change experiment in this thesis. 


3.1.3. Cross Validation, Test and Training Data 

In machine learning, data is needed to train a model, but once that model has been built, data is 
also needed to test the quality of that model. N-fold cross-validation is the practice of dividing a 
dataset into N equal-sized subsets, and then iteratively reserving one, training on the remaining 


N — 1 subsets, and testing on the reserved subset. A typical number for N is 10%. 


Our corpus contained a large number of threads, with a large number of messages. A 20/80 


split was used between our test data and training data. 


3.1.4 Latent Dirichlet Allocation 
As introduced in Chapter 2, Latent Dirichlet Allocation (LDA) is a technique which models a 


natural language corpus as a probabilistic distribution over topics. 


Before defining LDA itself, recall these mathematical concepts from probability theory: 


A conjugate prior is a prior which results in a posterior probability distribution of the same 


algebraic form, or family, as the prior. 


A multinomial distribution of order k is one in which each time a measurement is made, 
exactly one of k possible outcomes occurs. Multinomial distributions are also sometimes 
referred to as categorical distributions, where there are k categories. The most popular 


form of multinomial distribution is, of course, the binomial distribution. 


Each topic has a probability distribution over the documents in the corpus, and each word has 


a probability distribution over the topics in the documents. These distributions are multinomial 
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distributions, because each time a selection is made, exactly one topic or word is chosen from 


the possibility set. 


The main question in probabilistic topic models, is how to model these distributions, and what 
the assumptions, or priors, will be about those distributions. The following derivation follows 


closely what is presented in [19]. 


The probability distribution over words in a given document for T’ topics is 


P(wi) = >> P(wilzi = 7)P( = 3) (3.1) 


j=l 


For a given topic 7, let 6%) = P(w|z = j) be the distribution over words in the corpus for topic 
j; similarly, 0 = P(z) is the distribution over topics for a specific document, d € D, where 


D is the total set of documents in the corpus. Then (3.1) can be written as 


T 
Pays." (3.2) 
j=l 


@ and @ are referred to as the mixture weights of the words and topics, respectively. These 
parameters indicate which words are important for which topics, and which topics are important 


for which documents. 


To give a starting point for determining these mixing weights, Blei et al. [3] apply a prior to 
the topics in the form of a Dirichlet distribution. The Dirichlet distribution is a conjugate prior 
for the topic mixing weights 6. For a model with T' topics, the T-dimensional Dirichlet is a 
distribution over the set of possible probability distributions p = (p1,...,pr) for the topics, 


and is given by 


Lay) ea 
Dir(ay,..., = Si ap 33 
( 1 ar) TILT (a;) j=1P; (3.3) 
The parameters a),..., a7 are considered hyperparameters for the topic model itself. In prac- 
tice, symmetric hyperparameters are used, such thata,; = ag=...=ar=a. 
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Figure 3.2 shows a 2-dimensional representation of the probability space for 3 topics following 
a Dirichlet distribution. The triangle shape in the figure is a geometric notion called a simplex. 
The simplex is a coordinate system for probability distributions; each point p in the simplex 
is a T-tuple of the probabilities p; for each topic 7 € T, and )> ; Pj = 1- that is, as required, 
the sum over the probabilities for each of the topics is 1. One can think about the simplex as 
the (n — 1)-dimensional region that connects the basis vectors for n-space. So for 3-space, we 
would have the 2-dimensional triangle between (1, 0,0), (0, 1,0), and (0,0, 1). 


The hyperparameter a affects the smoothing of the topics across the simples. Higher alpha has 
a squeezing effect, focusing the distributions around the center of the simplex, and therefore 
resulting in greater smoothing or similarity between the different p,;s. In practice, a is generally 
set to ; 1, which pushes the modes of the Dirichlet to the corners of the simplex, leading to 


greater distinction between the different topics. 


Topic 3 Topic 3 


Topic 1 Topic2 Topic 1 Topic 2 


Figure 3.2: Symmetric Dirichlet distribution for three topics on a 2-dimensional simplex. Darker colours indicate 
higher probability. Left: a = 4; right: a = 2. Figure and caption from [19, p.5]. 


Steyvers and Griffiths [9] [10] [11] extend this model by applying a Dirichlet prior Dir((3) to 
the mixing weights of the words over topics, @, as well. Thus, there are now two hyperparame- 


ters to the model: a and £. 


For the implementation, instead of estimating the latent hyperparameters ¢ and 0, the algorithm 
directly estimates the probabilities for each topic 7 € z directly, using a process called Gibbs 


sampling. Once the posterior for z has been estimated, ¢ and @ can also be estimated. 


In Gibbs sampling, each word or token in the corpus is given a probability of arising from 


a given topic, as a function of the topic probabilities for all other words in the corpus. This 
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conditional probability distribution is derived in [20] and given in [19] as follows: 





CWP Cer +a 
P= 70 tiy dys) ae T a (3.4) 
Spe Cas F. Wp = Ca.5 + La 


CWT and C?? are WxT- and DxT-dimensional matrices, respectively, of token counts. The 
first matrix contains the number of time a word w; is assigned to topic 7; the second matrix 
contains the number of times topic 7 is assigned to a word token in document d. In practice, 


equation (3.4) is normalized by the number of topics, 7’. 


The Gibbs sampling process itself has two phases, an initial so-called burn-in period, during 
which samples must be discarded, and post-burn-in, during which samples begin to converge 
on the true posterior distribution more accurately. The process of Gibbs sampling begins by 
assigning a random topic to each word token. Then, a new topic (where, recall, a topic in this 
case is simply another probability distribution based on the Dirichlet prior) is sampled from 
equation (3.4), and the count matrices are updated based on the new topic assignment. Every 


sample performs a topic assignment for all NV word tokens in the corpus. 


Because the initial topic assignments are random, and because new topic assignments are sam- 
ples from a probability distribution, multiple Gibbs samples from after the burn-in period must 


be obtained and combined, to generate a representative sample. 


The output of the LDA algorithm using Gibbs sampling is a set of weights for each word in the 
corpus, for each topic. That is, for a corpus of N words, each topic will contain NV words, with 
a corresponding weight indicating its likelihood of being drawn from that topic. A smoothing 
process means that a word never has an absolute zero probability of being drawn from any topic. 
These outputs can be used as inputs to other algorithms, or can be used directly to estimate the 


topic probabilities of new documents. 


3.1.5 Distance and Similarity Measures 


Four measures of similarity or distance are used in our experiments, falling into two categories— 


distance metrics, and entropic measures. 


To be considered a true metric, a distance measure must satisfy the properties that, for three 
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vectors x, y, and z, 


eo) 0 Sa Sy 
d(x,z) < d(x,y) + d(y, z) 


The latter condition is known as the triangle inequality. 


The euclidean distance is probably the most familiar notion of distance, and is the only true 


distance metric used. Recall that for two n-dimensional vectors x and y it is expressed as 





d= JV (y— 21)? +... + Yn — En)? (3.5) 


Euclidean distance has a range of [—o00, +00]. 


Cosine similarity is a measure of similarity between two vectors. It is calculated by taking the 
cosine of the angle between two feature vectors. This is a rather intuitive measure of similarity: 
when two vectors are exactly the same, the angle between them is 0, and the cosine of the angle 
between them is 1; when the vectors are orthogonal, the cosine value is 0. The cosine similarity 
is given by the quotient of the dot product between the two vectors, with the product of their 
length. If x and y are the input vectors, the cosine similarity is given as 


cos @ = ——*—_ 
Illy 


(3.6) 


As can be seen from (3.6), the measure is inherently normalized, and its range is [—1, 1]. Be- 
cause of this normalization, two vectors in the same direction with different lengths, will have 
a similarity value of almost 1, but will have a euclidean distance equal to the difference in their 
lengths, which in general can be arbitrarily large. One can see, then, why cosine similarity is 
a useful tool for comparing the similarity of documents, since we would indeed likely consider 


two documents of different lengths with the same words ‘similar’. 


For the e-mail classification experiment, classification decisions are based on comparison of 
the distribution of topic probabilities between a set of potential threads and a test e-mail. As a 


result, we also calculate the Kullback-Leibler (KL)- and Shannon-divergence. Both are entropic 
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measures of difference between probability distributions. 


KL-divergence between P and Q is an asymmetric measure of the amount of information, or 
number of bits, needed to encode distribution P based on distribution Q. It is calculated in the 


following way: 


KL(P,Q) = 7 Pl) log PQ) 37) 


The Shannon divergence is a symmetrized version of the KL-divergence. It is essentially the 


average of the KL-Divergence in both directions: 


Shannon(P, Q) = Shannon(Q, P) = 4KL(P,1(P + Q)) + 4KL(Q,4(P+Q)) (3.8) 


Experimental results are calculated using all four difference metrics, and compared in detail in 
Chapter 5. 


3.1.6 DBSCAN 

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. As the 
name suggests, it is a density-based clustering algorithm which efficiently discovers clusters of 
arbitrary shape. DBSCAN also allows for certain points to be deemed noise, and therefor not 


allocated to any cluster [7]. 


DBSCAN is used in the automated detection of topic change experiment, because we do not 
want to make assumptions a priori about the number of topics there are in a thread. 


To discover clusters, the algorithm looks for points which have a minimum number of neigh- 
bouring points, min_pts, within some radius € (i.e., sets of points with a certain density). Any 


distance function can be used to compute the radius, although euclidean space is used here. 


A key observation of the DBSCAN algorithm is that there are two kinds of points in a cluster: 
core points and boundary points. Core points will contain the requisite neighbour density, but 


boundary points will not. 


Thus DBSCAN demands that all points within a cluster are density-reachable from one another. 
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Instead of requiring every point in the cluster to have min_pts within its e-radius, the require- 
ment of density reachability instead requires that every point in the cluster have a neighbour 
within ¢ that has mzn_pts within its radius. That is, for every point q in a cluster C’, there must 
be another point p within the epsilon neighbourhood of q that has mzn_pts in its neighbour- 
hood. p is then said to be directly density-reachable from q, and points reachable in a chain of 


directly-density-connected points are considered just plain density reachable. 


Finally, there may be some points in a cluster which are not density-reachable, but which are 


connected via a common density-reachable point. These are called density-connected. 


@ 0) gee pect denty- 
e * 6 Ca) © reachable from q 


pi border point © fe e . 


nn ee 6 e . e » 
: core pol 

a : ee wee . : ’ » ” Qnot directly density- 
oe ae e. reachable from p 


Figure 3.3: Illustration of points that would be considered border points and core points in the DBSCAN algorithm. 
Figure from [7, p.3]. 


All density-reachable and density-connected points make up a cluster. More formally, let D be 
a set of data points. A cluster, defined with respect to « and min_pts is a non-empty subset C’ 
of D satisfying the following conditions: 


1. Vp,q if p € C and q is density-reachable from p with respect to « and min_pts, then 
qeEC. 

2. Vp,q € C,p is (at least) density-connected to q via € and min_pts. 

3. Any point p € D and ¢ C,...C,, where C,...C,, are the clusters of the data set, is 


considered noise. 


Figure 3.4 shows an example of both density-reachable and density-connected points. For 
topic change, we use DBSCAN to identify clusters of keywords forming a topic, and most 
importantly, where the boundaries of those clusters are. Boundaries are affected by the input 
parameters to the algorithm, and are the key element in identifying transition points, or topic 
changes. 


Appendix A contains the code for this algorithm, and current links to its availability online. 
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Figure 3.4: DBSCAN density-reachable and density-connected points. Figure from [7, p.3] 





3.1.7 Topic Change 


In order to give some quantitative measure of the performance of the topic change algorithms, 
the traditional measures of precision and recall were used— being calculated as functions of 
true positives (tp), false positives (fp), and false negatives (fn)— with very slight modification. 
For a given thread, we care both about the number of changes detected, and the accuracy of 
the locations of those changes. For a thread with N actual topic changes, if D changes were 


detected, and L of those changes were in the correct location, then the precision is defined as: 





t L L 
| Sete = ay De 3.9 
p= (3.9) 
O otherwise 
False positives are the number of topic changes detected that were not actually topic changes. 
Recall is defined as 


t = L : 
aaa ~ L+(N-D) ifN > D 


— 0 if L=0 (3.10) 


1 otherwise 





The notion of false negative corresponds to the number of topics that were missed—those that 
were considered negatives but were actually positives. Consider a thread that contains one 
topic change. If 50 topic changes were detected, including one in the correct location, then 
the precision is low, because there were many false positives. However, the recall would be 1, 
because there were no false negatives—that is, all the changes that exist were detected (and in 
the correct location). 


Similarly, if a thread has 2 topic changes, and 2 changes were identified, but neither in the 


correct location, then both precision and recall would be 0. 


ee 


It is worth noting that true negative is not correspondingly well defined here. One could think 
of the number of true negatives as the number of words, sentences, or paragraphs during which 
a topic change does not take place, but it is somewhat different from the traditional notion of 


true negative. 


The F-score is calculated in the usual way: 


e Ae ifp > Oandr >0 (3.11) 
-SCOTE = ; 
0 otherwise 


The F-score is used as a measure of accuracy because it is a special form of the harmonic mean. 
The consequence of this algebraic form is that neither P nor R can be inflated at the other’s 


expense, without compromising the total F-score [16]. 


3.2 Data Processing 
The corpus used for this thesis was 4.9GB of the author’s personal e-mail over the course of 


approximately 3 years. 


Although these experiments use e-mail for its rich social value, there is a large amount of other 
content which must be accounted for when processing arbitrary e-mail data sets. Discussion 
lists, announcements, calendar invites, administrivia, company-wide broadcasts, and notifica- 
tions from reminder services. Further, some services such as Gmail store chat logs as e-mails; 
some people might use e-mail as a way to send themselves reminders, or even as a file store. 


This is all noise from the perspective of our research. 


To remove the vast majority of these noisy e-mails, only threads of length 5 or longer were 
retained. 5 was chosen because we are interested in threads where the topic has some time to 


develop, and where there is a meaningful conversational (back-and-forth) component. 


After removing shorter threads, empty messages, and messages with only attachments or with 


unrecognized encodings, there were 2168 threads remaining. 


3.2.1 Experimental Overview 
Figure 3.5 shows a high-level overview of the steps taken in order to pre-process the e-mail 


data, and how and in what order the various techniques described herein are applied. 
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Figure 3.5: High-level overview of steps taken for data pre-processing, and both experiments. 


3.2.2 E-Mail Processing 


Corpus messages were in Mbox format. Although Mbox format can come in several different 
flavours, generally speaking it is a plain text electronic mailbox format which stores all mes- 
sages within a folder in a single file. Messages are separated by a blank line and their beginning 


is delimited by the word From_ (note the space after the word, underlined for emphasis). 


Mbox files were parsed using Python’s e-mail-handling modules, and converted to internal Mes- 
sage objects which stored the headers, message bodies, and in-reply-to header field, if it was 
present (this field is described more in Section 3.2.3, below. Message bodies were extracted 
using the Content-Type header field, which describes the MIME-type formats contained in 
the message. Often times, a message will be multipart, containing a plain text formatted 


version of the message along with HTML or other encodings. 
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However, some messages are not multipart, nor do they have a plain text component. Some of 
these are HTML only. The Content-type ‘Message’ is one of these. In this case, the message can 
be encapsulated, along with possible error messages from servers, digests, forwarded messages, 


etc. 


For our purposes, anything not plain text- or HTML-formatted was discarded. The plain text 
body was selected if it existed, otherwise, if an HTML version was present, it was used, and the 
HTML removed separately. Message bodies were stored in UTF-8, replacing any unrecognized 
characters with the Unicode replacement character \uFFFD. 


3.2.3. Thread Extraction 
A thread is a series of e-mails which are related to one another through the messages they reply 
to. A thread is a tree-type data structure because multiple e-mails may be in response to the 


same e-mail, and messages may not necessarily be in response to the latest message in a thread. 


Threads were reconstructed using the message-id header field, and in—reply-to header 


fields of e-mail messages. 


The Message-id field contains a globally unique identifier typically made up of a message 
hash followed by an ‘@’ symbol and the mail server domain. For example: 
3cbh0e8e0610091234qg5af£Eb09Fq2969c9ca2a051c17@mail.gmail.com. 





The in-reply-to field also contains a message ID; that ID is of the message which the 
current message is, not surprisingly, in reply to. This field should not to confused with the 


reply-—to header field, which is a user-specified preferred e-mail address for message reply. 


Two passes are made over the messages to reconstruct the threads. During the first pass, if a 
message’s in-reply-—to header field matches the message-id field of another message, 
that message is added to the thread. If a message either does not have an in-reply—to header 


field, or that field does not match any messages already in a thread, a new thread is created. 


The second pass is a consolidation pass, since as threads are reconstructed, it may turn out that 
the initial message of a thread was in fact in reply to a message that came later in the processing 


queue. If this is the case, the two threads are consolidated. 


For each thread, we then identify each possible branch through that thread. Because we are 


interested in studying topic change over the course of a thread, how to handle the branching 
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Figure 3.6: Overview of thread structure and terms, showing in-reply-to structure and longest branch. 


structure of the thread is a consideration. Threads can be examined a) on a branch-by-branch 
basis (essentially treating each branch as a distinct thread), b) the branching structure can be 
ignored, focusing instead on the relative time ordering of the messages, or c) a representative 
branch can be selected, such as the longest branch, to represent the thread as a whole. For 
our purposes, a) was deemed to time consuming, b) was deemed impractical due to unknowns 
regarding timezones of intermediate mail servers, as well as the fact that often people do some- 
times reply to older messages in a thread on purpose. Thus, the ‘representative branch’ approach 
was used. 


After threads were reconstructed, they were stored in an internal Thread object and saved to 
disk in JSON format. A custom JSON-encoder and decoder were used. 


3.2.4 Quoted Text 

In order to properly analyze conversations, a way is needed to distinguish between original 
content in e-mail bodies, and the category of text we call ‘quoted text.’ Quoted text arises 
from three general categories: reply text is text quoted from one or more previous e-mails, and 
typically appears either inline, or appended to the current e-mail; signatures, which may be 
repeated many times in the course of a conversation thread but bear no relevance to the content 


(similar in some ways to stop words); and forwarded text, that may not have been present in 
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previous e-mails, and may well be relevant to the conversation at hand, but is not original text 


from the author of the message. 


A simple rule-based approach was applied using regular expressions to remove quoted text and 
forwarded messages. Each message is parsed for lines beginning with typical quoted-text sym- 
bols (one or more of the pipe (‘“—’), and the greater-than symbol (‘>’)). Patterns are developed 
to match lines which mark the beginning of a quoted section, such as the many possible vari- 


eties of prefixes similar to “On May 1, 2009, jessy <jessy.cowansharp@ gmail.com> wrote:”, 








or lines marking the beginning of a forwarded message, e.g. “ Forwarded message 


” 


Alternatively, quoted text can be interpreted as key to giving context to a discussion, and repeat- 
ing what someone has said might legitimately give higher weighting to their words. Further, 
if someone quotes a paragraph or sentence of a correspondent, and simply says, “yes,” then 
removing that text could be more harmful than helpful in terms of understanding the evolution 


of the conversation. 


We call the versions of the e-mail threads with the quoted text removed processed threads, 
and the version with all the original content raw threads. In the e-mail thread classification 


experiment in Section 4.1, we compare results with both processed and raw threads. 


3.2.5 Stop Words 

It is typical in many NLP applications to remove what are called stop words from text being 
analyzed. Stop words have high frequency but low meaning; they are words which stitch sen- 
tences together, such as the, it and at. Unfortunately, stop words are language dependent, and 
must be manually identified. Many corpora also have custom stop words as a function of their 


topic. 


For our applications, the English-language stop list was used from the Python Natural Language 
Toolkit (NLTK) [1]. In addition, a set of custom stop words were identified through initial 
analysis and removed as well. The full list of stop words, both generic and custom, is contained 


in Appendix B. 
3.2.6 Tokenization and Stemming 


For our purposes in these experiments we use a very simple tokenizer which creates strings from 


groups of alphanumeric characters. It’s acceptable if certain words are tokenized incorrectly or 
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somewhat arbitrarily, e.g., times (8:00 gets tokenized into 8 and 00, or http://example.com/index 
gets tokenized into http, example, com and /index)— as long as the tokenization is consistent. 
However, it would be undesireable for words with contextual punctuation to be treated as dif- 


ferent from ones without- eg. “no” vs. no, or *menu* vs. menu, or (for me?) vsme. 


Two types of word stemming, Porter and Lancaster, were applied to message bodies. The ideal 
word stymie would replace word tokens with their morphological roots. However, this is a rather 
difficult task in practice, and so various stemmers have been developed which take slightly 
different approaches to normalizing different tenses and possible conjugations of tokens. 


Character n-grams have been used as an alternative to stemming in some applications; the idea 
being that a good choice of n will have a similar effect, by truncating tokens in a manner similar 


to a stemmer. Based on the results in [5], we try character 5-grams. 
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CHAPTER 4: 
Experiments 





These experiments apply Latent Dirichlet Allocation (LDA) to the problems of e-mail-thread 
classification, and automated detection of topic change. 


Since the majority of threads are on a single high level topic, in the first experiment we explore 
how LDA performs on conversational corpora, by classifying e-mail messages with their cor- 
responding thread. Four distance metrics are used and compared as accuracy measures for this 


experiment. 


In the second experiment, we select a number of e-mail threads where the topic did not remain 
consistent, and attempt to automatically identify the topics using LDA, and the transition points 
between topics using DBSCAN. 


4.1 Message Thread Classification using LDA 

A naive guess about which thread an e-mail belongs to would select the longest thread; without 
any other information, this is the most likely category. Since this experiment has not been done 
previously, this becomes the baseline against which the e-mail thread classification experiments 
are compared. The baseline value is the percent of correct classifications we would expect a 
system using this naive decision scheme to achieve. The number of messages in the raw and 
processed thread groups are the same, and so the baselines are the same. The baselines for 
control groups of size 50, 100, 150, 500, and 1000, are given in Table 4.1. 


Experiment Baselines for Different Control Group Sizes 




















Group Size | Longest Branch | Total Messages | Baseline 
50 5 250 2% 

100 28 594 4.7% 
150 28 931 3% 

500 36 3489 1.03% 
1000 54 7374 0.73% 




















Table 4.1: Baseline accuracy values for different control groups. 


For this experiment, the open source library Machine Learning for Language Toolkit (MAL- 


LET) was used [17]). Mallet includes a parametrized interface for the creation of LDA topic 
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models, a highly efficient implementation of Gibbs sampling, and a number of tools for explor- 
ing the algorithm’s results. 


Using Mallet, a classifier was built, with 20% of the messages from each thread reserved for a 
test set, and 80% used as training data. The resulting classifier was then applied to classify the 
test e-mails with their original threads. All experiments were averaged over 5 runs; this number 
was determined by running one experiment 5, 10, and 15 times, and comparing the variation of 
the results. Since the variation was within + /—2 each time, an average of 5 runs was deemed 
to be sufficient. The LDA and Gibbs parameters to this model are outlined in Table 4.3 and 
4.2, respectively. These values were chosen based on experimental best-practice determined by 
Steyvers and Griffiths in [19]. 








LDA Model Parameters 
a | 50 
GB | 0.01 














Table 4.2: LDA Parameters for LDA model of e-mail threads 


Gibbs Parameters 
Iterations | 40 


Thinning | 3000 


Table 4.3: Gibbs Parameters for LDA model of e-mail threads 























LDA Topics: Top Words 
space | house nasa yuri volunteer 
earth room gov night people 
moon | people ames space events 
mars place arc event volunteers 
human | craigslist | center nasa room 
http living colab art setup 
nuclear | mansions | research | science stage 
nations | home 605 www 1 
climate | rainbow | http ames area 
science | move 604 worldspaceparty | table 























Table 4.4: Top ten words for 5 of the 50 topics in one of the LDA models, ordered by decreasing weight from the top. 
Note that these words are representative, and in practice change slightly for each run due to Gibbs sampling. This 
example also includes stop words which were subsequently removed. 


First, the topic weights or probabilities for each thread were calculated. For LDA, this is called 


estimating, and the result of an estimation is a vector with as many dimensions as there are 


30 


topics in the model. Each dimension corresponds to the estimated probability that the words in 


the e-mail were drawn from that specific topic. 


In order to classify e-mails, these topic probability weight vectors were compared with that of 


each thread, and the e-mail was classified as coming from the thread that was most similar. 


Four metrics were used to measure similarity, in order to compare their results: cosine similarity, 
euclidean distance, the entropy-based measure KL-divergence, and the symmetric version of 


KL-divergence, Shannon divergence. 


To begin with, we started with 100 threads, and results were obtained using Porter stemming, 
Lancaster stemming, no stemming, and character 5-grams. In Tables 4.5 and 4.6, we compare 


the results of these four tokenizing approaches using all four distance metrics. 


% Correct over 100 Processed Threads using 50 Topics 

















Porter Correct | Lancaster | Correct | No Stemming | Correct | 5-gram Correct 
Cosine 34 Cosine 33 Cosine 36 Cosine 16 
Euclidean | 29 Euclidean | 28 Euclidean 26 Euclidean | 11 
KL 33 KL 30 KL 30 KL 10 
Shannon | 37 Shannon | 35 Shannon 38 Shannon | 14 
































Table 4.5: Results from LDA classification experiment for each of Cosine, Euclidean, and Shannon and KL- 
divergence distance measures for processed threads. Since the results are a function of sampling, this is an 
average over 5 runs. Numbers may not sum to 100 due to rounding. 


% Correct over 100 Raw Threads using LDA Topics = 50 

















Porter Correct | Lancaster | Correct | No Stemming | Correct | 5-gram Correct 
Cosine 84 Cosine 84 Cosine 83 Cosine 80 
Euclidean | 88 Euclidean | 85 Euclidean 83 Euclidean | 66 
KL 86 KL 87 KL 87 KL 7T7 
Shannon | 88 Shannon | 85 Shannon 84 Shannon | 82 
































Table 4.6: Results from LDA classification experiment for each of Cosine, Euclidean, and Shannon and KL- 
divergence distance measures for raw threads. Since the results are a function of sampling, this is an average 
over 5 runs. Numbers may not sum to 100 due to rounding. 


As can be seen from Table 4.5, the correctly classified e-mails were in the low to high 30% 
range in most cases, except the character n-grams, which performed worse at 12%. Compared 


to our baseline of 4.7%, these classification results do show improved performance. 


The same experiment is performed with raw e-mail threads, without any quoted reply text, 


forwarded content, or signatures removed. Table 4.6 clearly shows that having this extra text 
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to train on is a substantial advantage. 


Based on the results of these initial experiments, we can see that Porter very slightly out- 
performs the other methods for the processed threads, but performs equally well with no stem- 
ming at all for the raw threads. Cosine similarity and Shannon distance were also slightly better 
performers in terms of the distance metric, but not significantly. We continue to calculate the 


different distance metrics throughout the other experiments. 


Varying LDA Topics for 100 Threads 


T — T 
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Figure 4.1: E-Mail classification performance for LDA models built with different numbers of topics, for both raw 
and processed threads. Each line shows a different distance metric. It can be seen that varying the LDA topics 
parameter improves the results. 


Next, the number of topics used as input to the LDA models was varied, to see if these first 


32 


results could be improved upon. Porter stemming was used exclusively for this experiment. In 
Figure 4.1, we can see that by varying the number of topics, the results for the processed threads 
(the bottom group) go as high as 45% for the Shannon divergence, with the optimal number of 
LDA topics at 150 or 175. The performance for LDA topics less than 50 is significantly worse 
than for the higher values. Other metrics follow a similar curve, with Cosine and Shannon 


competing (marginally) for first place. 


For the raw threads, as seen in the top group of Figure 4.1, there is a similar sharp improvement 
in accuracy up to 75 LDA topics. The best results are obtained at a value of 150 or higher. It 
looks like the results may be starting to dip down again after 200 LDA topics, but results for the 
Euclidean metric stay roughly the same. This graph would need to be extended to higher LDA 


topic numbers to verify if the results continue to decrease. 


For processed threads, the optimal number of topics seemed to be 150 LDA topics, while for 
raw threads we selected 275. Holding the topic number constant at these values for processed 
and raw threads, respectively, the number of threads is increased to see how the model will 
perform on larger control groups. Figure 4.2 shows how LDA performs when run on 50, 100, 
150, 500, and 1000 threads. As in the previous experiments, a 20/80 split was used, building a 
model from 80% of the data, and testing it on the remaining 20%. 


The accuracy of the classification results decreases as the number of threads goes up, for both 
processed and raw threads. Although increasing the number of threads increases the amount 
of data to train on, it also increases the choices the classifier has when selecting a thread to 
associate a test e-mail with. This suggests that the noise in the data is increasing faster than 
the quality of the topic models, resulting in decreased accuracy. Given the corpus, this is not 
terribly suprising. We explore this further in Chapter 5. 


The results for the raw threads are quite good here, even up to 1000 threads, while the results 
for the processed threads is fair up to about 150 threads, then dips under 25% at 500 threads, 
and achieves between 13 and 15% accuracy for 1000 threads. The results are still better than 
the baseline of 0.73%. 


For the 1000 thread control groups, the LDA topic variation experiment was re-run to see if 
the results differed for the larger number of threads. Figure 4.3 shows the results for the 
processed and raw threads. The processed threads are the buttons group, and the raw threads 


are the top group. The processed threads experience a minor improvement in classification 
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Varying Number of Threads 
(LDA Topics = 150 (processed)/275 (raw)) 
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Figure 4.2: A Comparison of e-mail classification performance for LDA models built with increasing numbers of 
threads. Results are shown for both processed threads (lower group) and raw threads (upper group). Each line 
shows a different distance metric. Performance decreases as the number of threads goes up. 


results as the number of LDA topics is increased up to 500, which is interesting since for the 
100 thread group, this number of topics resulted in decreased performance. The raw threads 
also demonstrate increased performance, but with a lower overall gain. 
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Varying LDA Topics for 1000 Threads 
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Figure 4.3: Performance as the number of LDA topics is increased, for 1000 processed threads. 
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4.2 Automated Detection of Topic Change 


While the majority of e-mail threads are on a single topic, a conversation can switch focus 
within a thread, either completely, or by moving into a different but related area. The question 


arises whether we can accurately define, detect, and measure such changes. 


Two approaches to detecting topic change were taken. Both were based on topic models built 
using Mallet’s implementation of Latent Dirichlet Allocation. All 2168 e-mails in the corpus 


were used to train a model, from which the weight of each word for each topic was determined. 


As can be seen in Figure 4.4, there is an exponential falloff in the weight or probability of each 
word for a given topic. Because of the probabilistic nature of LDA, and the smoothing applied, 
every word has at least some minimal probability of having being drawn from each topic. Thus, 
for the first approach, the top NV keywords from each topic in the topic model were extracted. 
A clustering algorithm was applied to the resulting output to identify subtopics, and transition 
points identified corresponding to topic changes. 


For the second approach, a sliding window technique was used. Each window was classified 
as belonging to a specific topic by calculating the total weight of the words in one window for 
each topic, and selecting the topic with the maximum value. As the window moved across the 
thread, if the topic classification of the window differed from the previous one, a topic transition 


was identified. 


The keyword clustering experiment involved iterating over the words of a thread; if a keyword 
appeared, this was taken as an indicator of the topic. Simplistically, as different keywords 
appear over the thread, if they belong to a different topic, then a topic change is considered to 


have occured. 


One of the challenges with this approach is that certain terms, like ‘http’, and ‘NASA’ are in 
the top N words for many topics. In certain cases, this may be a function of the relatively small 
sample size, but it is also the case that many individuals will have cross-cutting themes in their 


personal communications. 


In addition, certain words are genuinely content, but have a high frequency in the corpus and 
thus in multiple topics, because of natural commonalities in the topics of an individual’s e-mails. 
Such frequent keywords have the same problem as stop words-they dilute the topic assessment 
because they do not provide a clear indication of topic. 
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Word Weight Falloff for each Topic 


1000 





Top 25 Keywords 


Weight (unitless) 








400 600 1000 
Word Index 


Figure 4.4: This figure shows the keywords per topic sorted in order of decreasing weight (or relevance) to that 
topic. The x-axis shows the word ordering, while the y-axis shows the weight of a word for a given topic. There is 
one line per topic. The words themselves are not displayed; rather, the chart is shown to emphasize the exponential 
falloff of word-topic weights. 


A threshold, 7’, is chosen for the number of topics a word can be a keyword for before it becomes 
too diluted. If a word exceeds the threshold, it is discarded from the keyword list, and the next 
most frequent word in that topic (that is not also too common) replaces it. What we end up with 
is a list of key distinguishing words. Others have taken a more formal entropy-based approach 
[21], but that was not explored here. Higher values of NV imply a more relaxed topic definition, 
accounting for more peripheral words in a topic; thus, we can think of N as a relaxation factor. 


If 7’ is increased, a keyword can be present for more topics, meaning that they will be less 
unique. In a way, this can be thought of as playing as similar role to the hyperparameter a in 
the Dirichlet distribution. Increasing 7’ increases the amount of keywords a topic can share, or 


how much they overlap. 
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Keyword Selection Parameters 
Parameter Effect 
N (Keywords) | Relaxation 
T (Threshold) | Uniqueness 


Table 4.7: High level effects of varying the parameters N and T in the selection of keywords used in topic identifi- 
cation. 























If a low threshold is chosen, the resulting output will emphasize words that were more indicative 
of that topic. If N is also decreased, the distinguishing words are those which are increasingly 
related to that topic. In practice, the result is a squeezing effect that emphasizes groups of 
words which appear frequently in the same grouping, such as newsgroup footers or frequent 


correspondents’ signatures. 


Topic Correlations for Keywords in thread 73578616 length_12.bodies 
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Figure 4.5: Topic correlations for keywords in a single e-mail thread. LDA Topics = 50; Keywords = 10; Threshold = 
5. We can see that structure emerges in the thread. Note that the x-axis shows words in the order they appeared in 
the thread, and thus also correlates with time. 
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In Figure 4.5 it can be seen that this method begins to indicate some structure in the threads. 
In order to automate detection of significant changes, a custering algorithm is applied to the 
data. The goal of the clustering is to determine where the groupings of points in Figure 4.5 are 


significant, and how to determine the boundaries of topics. 


The clustering algorithm used is called DBSCAN, as described in Section 3.1.6. DBSCAN is 
a density based clustering algorithm, with the important characteristic that it discovers, instead 
of taking as input, the number of clusters in a given set of data points. DBSCAN has two 
input parameters, €, and min_pet’s. € defines the radius around a given point that is searched 
for neighbours, while min_pts defines the minimum number of neighbours that radius must 


contain. 


For this application of the DBSCAN algorithm, we only want points to be clustered with other 
points from the same topic. Since a point can only be considered a neighbour of another point 
if it’s within a distance of € or less, to enforce clustering within topics only, the feature vectors 
passed to DBSCAN are defined such that each topic is its own dimension, and the distance 
between topics is always greater than e. 


Larger € means fewer, but larger, clusters. Similarly, as m7zn_pts is increased to values closer to 
€, the algorithm will find more densely connected regions. In this experiment, because we are 
only clustering within topics, min_pts represents the number of keywords in a given window 
that should be part of the same LDA topic, before we consider it representative of a topic of the 
thread. In other words, mzn_pts is the minimum number of points needed to form a coherent 
topic within the thread. € is a measure of how tightly bound the topics are. If the ratio of 
min _pts to € stayed the same, but both values grew larger, it would be as though the focus of 
the clustering had blurred, or the edges of each group were less well-defined. 


We want to choose a min_pts/e ratio that will be sensitive enough to detect changes in topic, 
but forgiving enough to account for the uncertainty introduced by the probabilistic nature of 


word-topic associations in LDA. 


Finally, in order to mark the transitions between the identified clusters (topics), the mid-point 
between the end of one cluster and the beginning of another was selected. Topic changes at the 
first word of a thread were discarded. 


Twelve threads were identified that contained one or more clear topic changes. These threads 


typically either had a main topic, with a detour in the middle from which it turned back on track, 
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or, exhibited a complete topic change from which it did not return. The former threads contained 
two changes, one to change topic onto the detour topic, and the second to change topics back. 
These detour topics are typically shorter than the parent topic. The latter threads exhibited only 
a single change. The second topics were often items that the original topic reminded the author 


to bring up, while others were unrelated to any contextual information. 


Between the keyword selection and the DBSCAN parameters, 4 variable values must be selected 
for each run of the experiment. In Tables 4.8 to 4.13, the results are shown for 6 runs of the 
experiment over 12 threads, using the word-topic weights from LDA with the number of topics 
set to 50. The parameter inputs for keyword selection and the DBSCAN algorithm are included 
in the table header as a 4-tuple representing (NV, T’, €, min_pts). The number and location of 
topic changes identified were tracked, and compared to the number of actual changes in the 
thread (determined by hand labelling), and their correct locations. Precision and Recall were 


calculated for all threads. The location accuracy was determined to within +-/— one sentence. 


Figure 4.6 shows the results of clustering and transition point identification for a single thread. 
The top plot shows the LDA topics plotted for the selected keywords and threshold, and the bot- 
tom shows the clusters after scanning. The vertical lines are automatically determined transition 
points. The x-axis shows the keywords in the same order they appeared in the original thread, 
and thus also represent the time dimension. Note that in this (and subsequent) images, only the 
top NV keywords are shown. Points which were not allocated to any cluster are considered noise, 


and they are shown in a light gray colour on the bottom plot. 


The second approach taken was the sliding window-based technique. Recall that each word has 
a certain weight associated with each topic. In general, words have a high weight in association 
with only a few topics. Thus, for each window of size N words, the corresponding word 
weights for each topic were summed, and the topic with highest associated value was selected. 
A transition was identified as occuring at the middle of the window (N/2) location. As the 
window was moved across the text of the thread, the topic could be seen to change at specific 


points, as seen in Figure 5.3. 


The challenge with the window classification scheme was to select a window size large enough 
to smooth over elements like signatures, but small enough to capture genuine topic changes. 
Trial and error resulted in the selection of 20, 40 and 100 for the window sizes. The results for 


the sliding window experiment applied to the 12 threads are shown in Tables 4.14 to 4.16. 
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Figure 4.6: An example of the clustering results for a single thread. Threshold T=25, N=10 min_pts = 4 and e« = 6 


(LDA Topics = 50). Vertical lines show identified topic transitions. Personally identifying terms are grayed out for 
privacy purposes. 
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DBSCAN - (N=25, T=5, E=6, MP=4) 




































































Thread ID Len (ch) | Msgs | # Changes | # Found | Cor. Loc | P R |F 
75273856_length_8 2097 8 1 1 0 0; O 0 
53773016_length_7 908 7 1 1 0 0; O 0 

78378912 _length_15 4530 14 1 4 1) 0.25 1| 04 
85425331_length_6 1820 6 Z 1 0 0; O 0 
83944656 _length_9 3450 9 1 1 1 1 1 1 
83872656 length 15 4148 15 3 1 0 0; O 0 
59502032_length_13 2714 13 pi 1 1 1 | 0.5 | 0.67 
75215072_length_7 3868 7 1 2 0 0; O 0 
65048659_length_5 14436 5 1 3 0 0; O 0 
83864104 _length_18 8743 18 1 5 1) 0.2 1 | 0.33 
73555624 length_7 6063 7 1 2 0 0; O 0 
536387 13_length_6 1306 6 1 1 1 1 1 1 








Table 4.8: DBSCAN results for (N=25, T=5, E=6, MP=4). Results show the Thread ID, thread length in characters 
(len (ch)), number of messages in thread branch, the number of actual topic changes in the thread (# Changes), the 
number of topic changes found (# Found), and the number found in the correct location (Cor. Loc). P, R, and F give 
the precision, recall and F-score, as defined in 3.1.7 





DBSCAN - (N=25, T=5, E=10, MP=6) 












































Thread ID Len (ch) | Msgs | # Changes | # Found | Cor. Loc | P| R | F 
75273856 _length_8 2097 8 1 0 0;0) 0) 0 
53773016 length_7 908 7 1 0 0;0)| 0) 0 

78378912 length_15 4530 14 1 1 0;0) 0/0 
85425331 length 6 1820 6 2 0 0;0)| 0) 0 
83944656_length_9 3450 9 1 1 1}/1; 1) 1 
83872656_length_15 4148 15 2) 1 0;0) 0/0 
59502032_length_13 2714 13 2 0 0;0)] 0/0 
75215072_length_7 3868 7 1 1 0;0) 0/0 
65048659 _length_5 14436 5 1 1 0;0) 0/0 
83864104 length 18 8743 18 1 | 0;0)| 0/0 
73555624 _length_7 6063 d, 1 2 0;0)| 0) 0 
536387 13_length_6 1306 6 1 1 1/1}; 1/1 



































Table 4.9: (DBSCAN results for N=25, T=5, E=10, MP=6). Results show the Thread ID, thread length in characters 
(len (ch)), number of messages in thread branch, the number of actual topic changes in the thread (# Changes), the 
number of topic changes found (# Found), and the number found in the correct location (Cor. Loc). P, R, and F give 
the precision, recall and F-score, as defined in 3.1.7 
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DBSCAN - (N=10, T=5, E=6, MP=4) 




































































Thread ID Len (ch) | Msgs | # Changes | # Found | Cor. Loc | P R |F 
75273856_length_8 2097 8 1 0 0 0; O 0 
53773016_length_7 908 7 1 0 0 0; O 0 

78378912 _length_15 4530 14 1 3 0 0). 0 0 
85425331_length_6 1820 6 Z 0 0 0; O 0 
83944656 _length_9 3450 9 1 1 0 0; O 0 
83872656 _length_15 4148 15 3 3 2 | 0.67 Keil OS 
59502032_length_13 2714 13 pi 1 1 1 | 0.5 | 0.67 
75215072_length_7 3868 7 1 0 0 0; O 0 
65048659_length_5 14436 5 1 1 0 0; O 0 
83864104 _length_18 8743 18 1 4 1 |) 0.25 1| 04 
73555624 _length_7 6063 7 1 4 1 |) 0.25 1| 04 
536387 13_length_6 1306 6 1 1 1 1 1 1 








Table 4.10: DBSCAN results for (N=10, T=5, E=6, MP=4). Results show the Thread ID, thread length in characters 
(len (ch)), number of messages in thread branch, the number of actual topic changes in the thread (# Changes), the 
number of topic changes found (# Found), and the number found in the correct location (Cor. Loc). P, R, and F give 
the precision, recall and F-score, as defined in 3.1.7 





DBSCAN - (N=10, T=5, E=10, MP=6) 












































Thread ID Len (ch) | Msgs | # Changes | # Found | Cor. Loc | P| R | F 
75273856 _length_8 2097 8 1 0 0;0) 0) 0 
53773016 length_7 908 7 1 0 0;0)| 0) 0 

78378912 length_15 4530 14 1 1 0;0) 0/0 
85425331 length 6 1820 6 2 0 0;0)] 0) 0 
83944656_length_9 3450 9 1 1 0;0)|] 0) 0 
83872656_length_15 4148 ibs) 3 a i ied eT BG 
59502032_length_13 2714 13 2 1 0;0)| 0/0 
75215072_length_7 3868 7 1 0 0;0) 0) 0 
65048659 _length_5 14436 5 1 1 0;0) 0/0 
83864104 length 18 8743 18 1 1 0;0)|] 0) 0 
73555624 _length_7 6063 d, 1 yA 0;0)|] 0) 0 
536387 13_length_6 1306 6 1 1 1)/1}; 1/1 



































Table 4.11: DBSCAN results for (N=10, T=5, E=10, MP=6). Results show the Thread ID, thread length in characters 
(len (ch)), number of messages in thread branch, the number of actual topic changes in the thread (# Changes), the 
number of topic changes found (# Found), and the number found in the correct location (Cor. Loc). P, R, and F give 
the precision, recall and F-score, as defined in 3.1.7 
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DBSCAN - (N=25, T=51, E=15, MP=10) 

































































Thread ID Len (ch) | Msgs | # Changes | # Found | Cor. Loc | P| R | F 
75273856 _length_8 2097 8 1 0 0;0)|] 0) 0 
53773016_length_7 908 7 1 0 0;0)] 0) 0 

78378912 length 15 4530 14 1 1 0;0) 0/0 
85425331 _length_6 1820 6 Z 0 0;0) 0) 0 
83944656_length_9 3450 o 1 1 0;0)|] 0/0 
83872656_length_15 4148 15 | 1 0;0)] 0/0 
59502032 _length_13 2714 13 2 0 0;0) 0) 0 
75215072_length_7 3868 7 1 1 0;0) 0/0 
65048659 _length_5 14436 =) 1 2 0;0)] 0) 0 
83864104 length 18 8743 18 1 1 0;0)|] 0) 0 
73555624 length_7 6063 7 1 3 0;0)] 0/0 
536387 13_length_6 1306 6 1 1 Ee Ws a ies fa 











Table 4.12: DBSCAN results for (N=25, T=51, E=15, MP=10). Results show the Thread ID, thread length in charac- 
ters (len (ch)), number of messages in thread branch, the number of actual topic changes in the thread (# Changes), 
the number of topic changes found (# Found), and the number found in the correct location (Cor. Loc). P, R, and F 
give the precision, recall and F-score, as defined in 3.1.7 








DBSCAN - (N=25, T=51, E=5, MP=3) 




































































Thread ID Len (ch) | Msgs | # Changes | # Found | Cor. Loc | P R/|F 
75273856_length_8 2097 8 1 0 0 0| 0 0 
53773016_length_7 908 i 1 1 0 0| 0 0 

78378912 length_15 4530 14 1 . 1 0.14 |} 1 | 0.25 
85425331 _length 6 1820 6 2 1 0 0| 0 0 
83944656 _length_9 3450 9 1 3 0 0| 0 0 
83872656_length_15 4148 15 3 7 3043] 1} 0.6 
59502032 _length_13 2714 13 2 1 0 0| 0 0 
75215072_length_7 3868 7 1 5 1| 0.2] 1 | 0.33 
65048659_length_5 14436 5 1 7 1 0.14 | 1 | 0.25 
83864104 length 18 8743 18 1 6 LOAF Wed O29 
73555624 _length_7 6063 7 1 7 1 0.14 |} 1 | 0.25 
536387 13_length_6 1306 6 1 1 1 1] 1 1 








Table 4.13: DBSCAN results for (N=25, T=51, E=5, MP=3). Results show the Thread ID, thread length in characters 
(len (ch)), number of messages in thread branch, the number of actual topic changes in the thread (# Changes), the 
number of topic changes found (# Found), and the number found in the correct location (Cor. Loc). P, R, and F give 
the precision, recall and F-score, as defined in 3.1.7 
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Window Size = 20 
Thread ID Len (ch) | Msgs | # Changes | # Found | Cor. Loc | P R/|F 
75273856_length_8 2097 8 1 4 0 0| 0 0 
53773016_length_7 908 ri 1 2 1} O05) 1 | 0.67 
78378912 length_15 4530 14 1 23 0 0| 0 0 
85425331 length 6 1820 6 2 14 1 0.07 | 1 | 0.13 
83944656 _length_9 3450 ? 1 10 1) O1 | 1 | 0.18 
83872656 length_15 4148 15 3 8 1) 0.13 | 1 | 0.22 
59502032_length_13 2714 13 2 11 2 | 0.18 | 1 | 0.31 
75215072_length_7 3868 a 1 ve 0 0| 0 0 
65048659_length_5 14436 5 i 76 1 0.01 | 1 | 0.03 
83864104 length_18 8743 18 1 48 1 0.02 | 1 | 0.04 
73555624 _length_7 6063 7 1 24 1 0.04 | 1 | 0.08 
53638713_length_6 1306 6 1 1 1 1| 1 1 
































Table 4.14: Sliding window results for window size = 20. Results show the Thread ID, thread length in characters 
(len (ch)), number of messages in thread branch, the number of actual topic changes in the thread (# Changes), the 
number of topic changes found (# Found), and the number found in the correct location (Cor. Loc). P, R, and F give 
the precision, recall and F-score, as defined in 3.1.7 















































Window Size = 40 
Thread ID Len (ch) | Msgs | # Changes | # Found | Cor. Loc | P R/|F 
75273856_length_8 2097 8 1 2 0 0| 0 0 
53773016_length_7 908 7 1 0 0 0| 0 0 
78378912 length_15 4530 14 1 8 0 0| 0 0 
85425331 _length 6 1820 6 2 b) 0 0| 0 0 
83944656_length_9 3450 9 1 d. 1 0.14 |} 1 | 0.25 
83872656_length_15 4148 1S. 5 4 1/0.25| 1) 04 
59502032 _length_13 2714 13 ps 6 1 | 0.17 | 1 | 0.29 
75215072_length_7 3868 7 1 4 0 0| 0 0 
65048659 _length_5 14436 5 1 54 1 0.02 | 1 | 0.04 
83864104 length 18 8743 18 1 48 1 0.02 | 1 | 0.04 
73555624 length_7 6063 d 1 7 0 0| 0 0 
536387 13_length_6 1306 6 1 0 0 0| 0 0 
































Table 4.15: Sliding window results for window size = 40. Results show the Thread ID, thread length in characters 
(len (ch)), number of messages in thread branch, the number of actual topic changes in the thread (# Changes), the 
number of topic changes found (# Found), and the number found in the correct location (Cor. Loc). P, R, and F give 
the precision, recall and F-score, as defined in 3.1.7 
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Window Size = 100 




































































Thread ID Len (ch) | Msgs | # Changes | # Found | Cor. Loc | P R/|F 
75273856_length_8 2097 8 1 0 0 0| 0 0 
53773016_length_7 908 y 1 0 0 0| 0 0 

78378912 _length_15 4530 14 1 2 0 0| 0 0 
85425331 _length 6 1820 6 2 0 0 0| 0 0 
83944656 _length_9 3450 2 1 1 0 0| 0 0 
83872656_length_15 4148 15 3 0 0 0| 0 0 
59502032 _length_13 2714 13 2 0 0 0| 0 0 
75215072_length_7 3868 7 1 4 0 0| 0 0 
65048659_length_5 14436 5 1 pA| 0 0| 0 0 
83864104 length 18 8743 18 1 28 1 | 0.04 ) 1 | 0.07 
73555624 _length_7 6063 7 1 1 1 1] 1 1 
536387 13_length_6 1306 6 1 0 0 0| 0 0 





Table 4.16: Sliding window results for window size = 100. Results show the Thread ID, thread length in characters 
(len (ch)), number of messages in thread branch, the number of actual topic changes in the thread (# Changes), the 
number of topic changes found (# Found), and the number found in the correct location (Cor. Loc). P, R, and F give 


the precision, recall and F-score, as defined in 3.1.7 


46 








CHAPTER 5: 
Analysis and Future Work 





In this chapter we discuss the results for both thread classification and the topic change detection 
experiments. The raw versus processed versions of the data set are considered, and the relative 
performance of the four distance metrics is compared. Characteristics and challenges of the 
data set are identified, and the definition and implementations of topic and topic change are 


explored. 


The conversational aspect of e-mail threads gives rise to high variance in vocabulary usage 
over the thread space, with a large number of contextual terms and references to concepts, 
experiences or discussions external to the thread itself. There are less nouns and other anchoring 
terms to train on, and more implicit assumptions about common ground and shared knowledge. 
In addition, e-mail communications are often short and informal, reducing meaningful context 


even futher. 


All these factors make e-mail conversations a rather challenging data set to analyze, and this is 
borne out in the results from Chapter 4 and discussed further below. At the same time, these are 
new areas of research, and it is our hope that baselining performance with these techniques will 


help to identify directions where future research can lead to improvements. 


5.1 E-Mail Thread Classification 


For the thread classification experiments, overall we saw decreasing performance for larger 
numbers of threads. Performance was substantially increased for classification of raw threads 
versus processed threads, due to the additional context. 


The best results were around 35% for the processed threads, and just over 90% for the raw 
threads. Porter stemming, Lancaster stemming, and no stemming performed roughly the same, 
with character 5-grams performing distinctly worse in both cases. The Cosine and Shannon 
distance metrics performed slightly better on the processed threads, and the KL-divergence 


performed marginally better for the raw threads, but not significantly. 


Stemming is a difficult task, so one explanation for the lack of improvement is simply that 


the stemming algorithms performed poorly on the text they were given. An inspection of the 
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Porter and Lancaster stemmed words does reveal that many words were stemmed incorrectly. 
For example, these words were stemmed as follows, using the Porter and Lancaster stemmers, 


respectively: 


manage legal financial — Porter Stemmer — manag legal financi 


manage legal financial — Lancaster Stemmer — man leg fin 


An incorrect word wouldn’t reduce the accuracy of the results, since all instances of that word 
would be stemmed in the same way, and therefore their counts in the LDA topic models would 
be unchanged. The advantage of successful stemming is rather that ‘managing’ and ‘manage’ 
will get counted as the same word. Incorrect stemming simply means that the words counted 
will be ‘manage’ and ’manag’ (for example). Another possible explanation is that our results 
are dominated by the LDA topic model quality, and slight changes in token counts may not 


matter much if the underlying topic model is not strong. 


It is interesting that the character n-grams performed substantially worse than the other tech- 
niques. Because n-grams are implemented as a sliding window, many are made up of the end of 
one word and the beginning of another word, implicitly giving them more context. In contrast, 
the other stemming techniques all use whitespace for token boundaries, and thus do not include 
context. That the techniques without this context performed better, might mean that words are 
not being frequently repeated in the same sequences. This could be a by-product of the fact that 
an e-mail corpus has many authors, thereby dominating any effects that an individual author’s 
‘voice’ might have. It could also be a function of swiftly changing contexts and tenses, or the 


casual nature of the medium. 


Varying the number of LDA topics did have a measurable impact on the performance of the 
classification task. In all cases, there was a dramatic drop in performance for very low values of 
the LDA topics parameter. Considering the size and nature of the data set, it could legitimately 
be the case that such a low number of topics is simply a poor fit for the data, causing multiple 
‘real’ topics in the data to be conflated. Additionally, because cosine similarity measures an- 
gular distance, it is much less sensitive to differences in values within dimensions than across 
them. Thus in general we would expect the cosine difference measure to perform slightly worse 


at lower topic values. 


We did see that, for all the topic variation experiments for both the 100- and 1000-thread control 


groups, the divergence in performance between the different distance metrics was lower when 
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the results were either very good or very bad. That is, the extreme good or bad results seemed 
to be more agnostic to the distance metric. Perhaps the differences in the categories (topics) 
are greater than the variation in the distance measures, as the quality of the model goes up (or 
down). Using the convergence of multiple distance measures might be an interesting technique 


for assessing the quality of the models in future experiments. 


As the number of threads was increased, a corresponding decrease in the accuracy of the thread 
classification was also observed. At first glance this isn’t suprising; more threads to choose 
from means, all else being equal, that there is a lower probability of selecting the correct thread 
for classification. However, additional threads also mean significantly more training data, and 
thus one would hope for it to increase the quality of the topic models, enough to maintain or 
improve the classification results. Instead, what we see is that the noise factor increases more 


quickly than the quality of the topic models. 


If additional threads are not refining existing topics, then it may instead be adding new topics, 
suggesting that the corpus, and thus the topics of this individual’s e-mails, contain more breadth 
than depth overall. 


The optimal number of topics for the LDA model changed for the larger thread groups. This is 
implies that 100 threads was not a large enough subset to be representative of the broader set of 
topics. As would be expected, the number of LDA topics increased for the larger control group, 


indicating that there were more latent topics. 


In general, there were several factors affecting the quality of the results for processed threads in- 
dependent of the algorithm applied. Fully cleaning an e-mail corpus from scratch is a formidable 
task, and our desire to work with threads, as opposed to individual e-mails, made it infeasible 
to use the more common Enron corpus (because the required header fields were not included). 
Many Usenet and modern newsgroups have been archived and are stored online for research or 
analysis purposes, but none that we came across had been pre-processed. There was no imme- 
diately clear advantage to using these newsgroups over a corpus belonging to the author, while 
using the author’s corpus did provide the bonus of familiarity with the contents. However, 
in retrospect, newsgroups may have offered a more rich conversational medium, with longer 


messages and more substantially fleshed out topics. 


Due to the private nature of an individual’s e-mail corpus, this makes the data set difficult to 


sanitize for public release, and thus rigorous verification of research results obtained using it 
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not possible. These are strong arguments against using personal data sets. While newsgroups 
would address this because many are public, it would also be useful if the community pooled 


efforts to clean and release individual e-mail corpora. 


The major effort in cleaning such a data set involves the removal of extraneous content; in 
particular, quoted reply text, forwarded content, and signatures. Hand removal was impractical 
for the timeframe and resources at hand, and so a best-effort attempt was made to automate 
the cleaning process (see Section 3.2). Unfortunately, the result was that many replies were 
missed, reducing the quality of the data set and making it more difficult to assess all the factors 
influencing the results. In particular, since the results with raw threads were so high, it is 
possible that stray reply content may have artificially inflated some of the results for processed 
threads. 


Signatures were an equally difficult challenge. They were not removed, since there was no 
clear way to do this in an automated fashion. Although they represented a smaller proportion 
of the text than quoted replies and forwards, they are not actually content per se. On one hand, 
an e-mail thread between two authors would legitimately recognize those authors’ respective 
e-mail signatures as being indicators of membership in a thread. Further, depending on the task, 
the raw (un-processed) thread content might be the only data on hand (for example, file carving 
in digital forensics). On the other hand, from a topic modeling standpoint, it’s a bit dishonest, 
since it is not genuinely topical. Worse, for shorter threads, signature text might even dominate 


the message bodies. 


As a function of the casual nature of e-mail, there are likely to be more spelling errors in e-mail 
conversations than published documents, causing mis-spelled words to be counted as distinct 
tokens. A simple improvement would probably be to run a spell-check over the message bodies 


before other processing was conducted. 


For the raw threads, given the high frequency with which prior messages were quoted in replies, 
running the classifier on the raw threads was equivalent to training on the test data. Although the 
raw thread results provide an upper bound, we’re really interested in seeing if the topic models 
built are robust enough to match a thread with an e-mail that has not been seen. Further, in the 
interest of generalizing results to other conversational domains such as chat, blog comments, or 


phone conversations, they do not in general have the luxury of quoted reply text. 


For our baseline, and for creation of the training and test data set, the atomic unit of analysis 
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was the message. Messages vary greatly in their lengths, and our results did not account for this 
factor. 


Based on these observations, we improve thread classification by building better data sets, 
studying multiple individuals’ e-mail corpora, and better parametrizing their nature and vari- 
ance. If there is enough commonality in the topics discussed by individuals within certain 
demographics, training on multiple peoples’ e-mail would help to refine the topic models and 


improve classification capability. 


5.1.1 Distance Metrics 

For the topic variation experiment with 1000 processed threads, the Shannon and Cosine mea- 
sures diverged notably as the number of topics grew, performing consistently better as the num- 
ber of topics increased. As well, for the 100 processed threads, Shannon and Cosine measures 


performed better than their counterparts. 


The Shannon and Cosine measures also perform better than Euclidean distance and KL diver- 
gence for the thread variation experiment with processed threads. For the raw threads, there is 


no clear dominant performer. 


The 1000 thread experiment with the raw threads shows the Shannon and KL divergence out- 
performed Cosine and Euclidean measures consistently for all topics, while the results for 100 


raw threads are too close for a winner to be declared. 


5.2 Topic Change 
While the keyword clustering experiment generally had poor results, many threads did have 
clusters that resulted in topics with clear boundaries, and contained close to if not precisely the 


correct number of topic change events. 


The range of F-score results for the different experiments was from 0 to 1. Most of the threads 
where all changes were accurately detected were those with a single change, and strong topics. 
Still, while there were a non-negligible number of finite results, the more frequent result was 0, 
meaning that 0 topic changes were accurately detected in the correct location. Here we explore 


some of the factors in the success and failure of the techniques applied. 


The runs with N = 25 performed better overall, although changing N from 25 to 10 had a 
smaller effect on the F-scores than varying the DBSCAN inputs. In both cases the, (€ = 6, 
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min_pts = 4) values performed better than (« = 10, mian_pts = 6). Although € = 10 is a more 
forgiving radius, the topic data was too sparse for min_pts = 6. For these experiments, often 


no clusters were even detected. 


Our corpus was relatively small, with a large breadth of topic coverage and short, terse e-mails 
on average. Although there were a few overarching themes across the corpus resulting in good 
models for a few topics, many of the other topics were poorly reflected in the topic models, 
with keywords not accurately represented. Correspondingly, words in the message bodies that 
perhaps should have been associated with a topic were missed, and the resulting clusters were 
of smaller size. In other cases, words had not developed a clear association with any topic, and 
topics that were seen to exist by human inspection were consistently treated as noise by the 


algorithms. Certain threads never had good cluster representation. 


This result could simultaneously be indicative of characteristic topic lengths in this medium. 
Rather than being a descriptive prose environment, most e-mail tends to be purposeful and 
functional (there are, of course, exceptions). Even a 7- or 8-message thread might only have a 
couple of thousand characters, leaving little time for a topic to develop. 


The (25, None, 5, 3) experiment increased the number of keywords again, removed the keyword 
threshold altogether, and then shrank the DBSCAN radius. Relaxing the threshold allowed more 
of the significant keywords to be included, in order to see if the topic boundaries would be more 
accurate. What we saw is that slightly more threads seemed to have non-zero results in this 
case, although they were on average lower than the non-zero results of the other runs. It seemed 
that the parameters for this run were a little too forgiving, often including words it shouldn’t 


have. 


There were several threads which consistently had high F-scores across multiple different pa- 
rameter inputs. These topics may have been better represented in the training corpus, and 
therefore while different parameters shifted the topic boundaries forward or back a few words, 
or changed the density of keywords in the cluster, the clusters themselves persisted. Overall, 
these more frequent topics were robust under different sets of parameter inputs, and performed 


well in these experiments. 


For some threads, the number of topic changes detected was correct, but the location was incor- 
rect because the clusters were too far apart. In these cases, the ‘median’ method used to identify 


transitions points (whereby the middle point between two cluster boundaries was taken) was at 
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Where should the 
real transition be? 


Topic (ordinal) 





Keywords 


Figure 5.1: DBSCAN clusters are too far apart, resulting in meaningless transition. 


fault, causing transitions to be placed far from the boundaries of either cluster. See Figure 5.1 
for an example of this. These smaller clusters were often more accurate, but their boundaries 
simply too tight. The resulting space between clusters introduced ambiguity about where the 
transition occured. In general, the location and size of neighbouring clusters greatly affected 
the transition locations. 


In other cases, a cluster ended, and no subsequent cluster was identified, but the transition to 
‘noise’ after the end of the last cluster was a meaningful transition from one topic that was 
rather well defined, to one that simply was not. This could be remedied by either or both of 
more suitable clustering parameters, as well as more nuanced transition point identification (see 
Figure 5.2). 


a 


Certain transitions were incorrectly inserted between clusters of the same topic. In these cases 
the division between clusters was an artifact of the parameters for that particular run of the 


experiment, and did not represent the presence of intermediary topics. See again Figure 5.2 


Meaningful transition 
not captured 


/ 


Topic (ordinal) 


Vie 


Keywords 





Figure 5.2: The end of the last cluster is a meaningful transition, but is not captured because there is no subsequent 
cluster. The transition identified between clusters of the same topic is an artifact of the experiment parameters, and 
not a real transition. 


The transition mechanism also did not perform well with overlapping clusters. In the present 
implementation, multiple clusters overlapping would cause the endpoints of those clusters to be 
averaged, when calculating the transition point between the end of one set of clusters and the 
beginning of another. This made the calculation simple, but there was no particular structural 
reason for these clusters to be handled that way. A more subtle treatment of transition point 
calculation would aid in the identification of accurate transitions. 
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Beyond the mechanics of topic change identification, there are subtleties to its definition as well. 
Almost never is a transition discrete, but rather it is more accurately modelled by the decreasing 
influence of one topic, and the corresponding increasing influence of another. An intuitive 
decision for the transition point in this case would be the point at which one topic overtakes 
another in terms of weight. But clearer steps should be taken to make logical decisions when 


there is a gap between clusters, or when the topic transion is gradual. 


For both experiments, the transition point selection mechanism involved taking the median 
point either of the sliding window, or the boundaries of two clusters. For the sliding window 
approach, this is clearly not a scalable approach if the window size were to get larger. In fact, 
this is a shortfall of the method overall, since by taking the total weight over a single window, 
the growing influence of secondary topics is smoothed over, as is the order of the words (for 
example, perhaps the winning topic is actually strong at the beginning of the window but tapers 
off at the end). A way to address this might be to keep track of multiple topics in each window, 


supporting the conceptual approach that documents are mixtures of topics. 


This thesis treated transition points between topics as discrete points in time. A more nuanced 
analysis of how topic changes, alternative ways to measure it, and ways to represent slow versus 


fast changes, or discrete versus evolutionary changes, would enrich the discussion. 


Overall, the sliding window technique performed measureably worse, in particular in the num- 
ber of changes detected (as opposed to their location). For many of the threads, this method 
vastly ovestimated the number of transitions (see Figure 5.3 C), causing overall high recall. 
The result in these cases was that the likelihood of one of those transitions being in the correct 


location increased, but the positive results in those cases were actually a side effect of the recall. 


Unfortunately, increasing the window size from 20 to 40 and then to 100 did more to negate 
previously correct transitions detected by the smaller window sizes, than to improve the vast 


overfitting observed in some threads. 


Between the keyword clustering and sliding window approaches, the sliding window approach 
was simpler to implement, but too sensitive to change. The keyword clustering approach was 
more challenging, with more parameters to understand and refine, but it was also better at 
detecting substance over noise. The DBSCAN method did not account for word-topic weights 
beyond their presence in the top N keywords. There was a question of whether this would gloss 


over important nuances, but it did not negatively impact the results. In fact, because the top 
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A Topic Tension Points for Teed 53520713 length 6 B Topic Hansiion Points for Thread 53773016 lenglh_T C Topic Transition Points for Thread 75215072 lenglh_7 
(Sliding Window Method, window sie = 20 (Sling Window Method, window sie = 20) (Siding Window Method, window sie = 20) 



































































































































Figure 5.3: Three outputs from the sliding window experiment. Window sizes as described in respective titles. A) 
Correct results; B) Erroneously detects two topics, close to the actual topic change; C) Extreme overfitting, too many 
topics detected. LDA Topics = 50). Personally identifying terms are grayed out for privacy purposes. 


words for a topic tend to have extremely high weights (again, see Section 4.4) it raises the 
question of whether the sliding window approach is foo influenced by the relative weights of 
words, especially as they get large. Future experiments might try modifying the sliding window 


approach by capping or smoothing the word-topic weights. 


Topic change is a more challenging area than e-mail-thread classification. The classification 
task had the advantage of working with all information in a thread, whereas the topic detection 
was working with shorter sequences of words; sometimes only a couple of sentences. This 
isn’t much content for the algorithm to accurately detect a topic, especially if the words used 
were not already heavily weighted towards a particular topic. Oftentimes, it might be missed 


altogether. 
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To a certain extent, using e-mail conflated many of the questions about how to detect topic 
change. Most messages are short, and topic changes, when they do occur, can be very subtle. 
Between the data cleaning and pre-processing issues discussed in the previous section, the mul- 
tiple possible paths through each thread, and the low frequency with which clear topic change 
occurs in e-mail, it was difficult to get good quality data, in enough quantity, to make statistical 


conlusions about the result. 


In summary, detecting the presence of topic change in a conversational setting still requires 
work, as evidenced by the results and discussions above. Many of the observations made here 
could likely improve the techniques, but first and foremost the current techniques should be 
applied to a larger corpus, and run with a significantly expanded range of parameters inputs. 
Additionally, a more formal analysis of how to optimize the paramters N, TJ’, € and min_pts 


used in the keyword clustering experiment is necessary. 


5.2.1 Open Questions 

Through this experiment, many important questions surfaced about the nature of topic change 
and how it should be measured. Every topic is arguably part of a hierarchy, and every topic has 
sub-elements. As a result, a critical factor in topic change analysis is the coarseness of analysis 
applied, and defining transitions that are formalizable and repeatable. 


LDA is a tool we use to statistically represent topics, and to some degree the number of topics 
selected as input parameter to the algorithm will determine this coarseness. At the same time, 
it should be remembered that the ‘topics’ discovered by LDA might not be the topics a human 


reader would pick out. 


For example, an entire e-mail thread might be about an academic course being taken, but the 
first few e-mails are about meeting for a study session, and the last few about the latest as- 
signment. Similarly, another thread might be about a vendor’s plans to film an event, but the 
beginning might be about getting the vendor event passes, and the second half might be a de- 


tailed discussion about the cabling they need. Less explicit changes need more training. 


Additionally, there are many topics which appear once in the corpus and then never again, 
and thus as mentioned above, the weighting of those words for the respective topics isn’t very 
good. It is easy to say that more and better data would help in our ability to detect more subtle 
transitions, and that is true. But it would also be very interesting to take multiple conversational 


corpora and train over the whole set, in an attempt to improve the quality of the models. If there 
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genuinely is overlap in conversational topics across individuals, this will be helpful. And where 
there is not, such a technique might instead be useful in identifying personal versus popular 


topics. 


Proximity between words of the same topic could be used to scale the weight of each word in 
either technique, decreasing the likelihood of stray words being picked up. In fact, the inherent 
time dimension of conversational texts suggests that a Markov approach giving greater weight 
to the current or previously seen topics, could help to improve the results. Topics could be 
clustered together, showing where transitions among certain sets of topics are more likely than 


others. The likelihoods would ultimately be a function of generic as well as personal categories. 


Part-of-speech tagging could be used to give grammatical context where token context is in- 
suffient. In particular, conversational contexts contain many question/answer type interactions, 
where even the probabilistic models applied here seem to fail. Modeling these interactive dy- 


namics and using them as, or to inform, the priors for topics might yield improved results. 


5.3. Stopwords and Keywords 

Additional work on averaging and entropy-based methods (such as [21]) could determine and 
remove stop words on the fly. This would reflect the fact that stop words are often a function of 
context, and would also support the development of language independent solutions. 


Consider Figure 5.4, which shows a full graph of LDA word-topic weights, where stop words 
have not been removed. Although the stop words are co-appearing with their more substantive 
neighbours, they are also co-appearing frequently with other stopwords. In the resulting LDA 
topic models, these stop words are weighted most heavily towards topics which appear to be 


dominated by stop words. This could be used as a way to programmatically identify stop words. 
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Figure 5.4: In this graph of a subsection of a thread, we can see that stopwords consistently appear in two topics 
above and beyond all others. Personally identifying names and locations have been blurred. 
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5.4 Signature Detection 

Signature detection and removal is a laborious and time consuming process if done by hand, 
and error-prone if done programmatically. In the course of examining patterns present in e-mail 
threads, small repeated blocks of text, in particular, signatures, were seen to be very distinct 
when viewed visually. In addition, because signatures of frequent correspondents appear multi- 
ple times, in exactly the same word sequences, these signatures are clustered together in topics 
with extremely high frequency (see 5.6). 
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Figure 5.5: A graph of topic word weight for one e-mail. Again the words along the x-axis are in order of appearance, 
so this dimension also represents time. 


This begs the question of whether LDA might be useful in automatically identifying and re- 
moving signatures. This would only be useful for authors with whom correspondence was 
moderately frequent. However, others have shown [18] that the relationship between corre- 
spondents and frequency exhibits an exponential falloff, suggesting that such a method, if it 


succeeded, would still be useful for the majority of e-mails in an individual’s corpus. 
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Number of Messages 





Relationships 


Figure 5.6: Power distribution of relationships in e-mail. Figure from [18, p.8] 
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5.5 Other Future Work 


Many times in this document the nature of conversational text compared to more traditional doc- 
uments has been mentioned. E-Mail has more stop words, more implicit references, and is more 
casual. It is possible these same characteristics which make e-mail challenging to analyze would 
also allow conversational text to be automatically identified in a stream of non-conversational 
text. For example, could the point in a webpage where a blog posts ends and the comments be- 
gin, be automatically identified? Or conversational content in noisy packet streams? Similarly, 
while conducting forensic file system analysis or file carving, such a technique could identify 


chat logs or e-mails not otherwise known to exist. 


While exploring the optimal number of input topics to the LDA algorithm, a question that 
arises is whether there might be a characteristic number of topics which map onto the average 
individual’s day-to-day communications. Do most people talk about a certain number of things? 
What can be said of those who have a larger or smaller breadth of topics they discuss? Might 
conditions such as Autism or Attention Deficit Disorder be detected via long term studies of 


topics in individuals’ personal communications? 
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CHAPTER 6: 
Conclusions 





We have demonstrated that state-of-the-art probabilistic topic models can be successfully ap- 
plied to classifying emails with their original threads. These documents are more casual and 
contextual, and generally shorter and more terse, than their more formal counterparts. For up 
to 1000 threads, we show that raw email messages can be correctly classified with up to 95% 


accuracy. 


For more generic conversational threads, without the characteristic quoted and reply text of 
emails, we present results significantly better than baseline, and identify several ways that the 
results could be improved upon. 


Further exploring the nature of conversation corpora, two new techniques for identifying topic 
change are developed and tested on a token set of cleaned email threads. The first is a keyword 
clustering method, and the second is a sliding window technique. The results show promise, 
and provide a concrete baseline on which future work can improve. We describe numerous 


ways in which these methods could be refined. 
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APPENDIX A: 
Code 





A.l1_ DBSCAN 


The DBSCAN Python implementation used for this thesis is included below. 


#!/usr/bin/python 


PRATT HEE HH ETH HRA HRA EE BE EH HE HTH HARE HE EH RTE 
# Jessy Cowan-Sharp, August 2009 


# References: 


# 1. "A density-based algorithm for discovering clusters in large 
# spatial databases with noise," Ester, M. and Kriegel, H.P. and 
# Sander, J. and Xu, X. 


THR R REE EEE HEH PRET HRT HE BETH EEE HTH HARE HE ER RTE 


very code-like pseudo-code 
DBSCAN 
for point in points: 
if point is visited: 
continue 
mark point as visited 
neighbours = immediate_neighbours (point, epsilon) 
if len(neighbours) > min_pts: 
cluster = new_cluster( 
append point to cluster 
for n in neighbours: 
cluster.append(all_neighbours (n) 
else: 


mark point as NOISE 


def all_neighbours(n, epsilon, cluster): 
for point in points: 
if point has not been visited: 
mark point as visited 
new_points = immediate_neighbours (point) 
if len(new_points) > min_pts: 
points.append(new_points) 


if point is not member of any cluster: 


feo 4h Roe 4 ROR ROSE EEE EEE EEE EEE 


append point to cluster 
from math import pow, sqrt 


class Point (object): 
‘rr internal helper class to support algorithm implementation’’’ 
def __init__(self, feature_vector): 
# feature vector should be something like a list or a numpy 
# array 
self.feature_vector = feature_vector 
self.cluster = None 


self.visited = False 


def __str__ (self): 


return str(self.feature_vector) 


def _as_points(points): 
‘r’ convert a list of list- or array-type objects to internal 
Point class’’’ 


return [Point (point) for point in points] 
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def as_lists(clusters): 
‘’’ converts the Points in each cluster back into regular feature 
vectors (lists).’’’ 
clusters_as_points = {} 
for cluster, members in clusters.iteritems(): 
clusters_as_points[cluster] = [member.feature_vector for member in members] 
return clusters_as_points 


def print_points (points): 


ree 


klugey function for printing lists of points. ''’ 


for p in points: 
s += str(p) + ‘\n’ 


return s[:-2] 


def euclidean (x,y): 
'’’ calculate the euclidean distance between x and y.’’’ 

# sqrt ((x0-y0)*2 + ... (xN-yN) *2) 

assert len(x) == len(y) 

sum = 0.0 

for i in xrange(len(x)): 
sum += pow(x[i] - y[il,2) 

return sqrt (sum) 


def immediate_neighbours (point, all_points, epsilon, distance, debug): 
‘’’ find the immediate neighbours of point.’’’ 
# NOTE: there is probably a better way to do this. 
neighbours = [] 
for p in all_points: 
if p == point: 
# you cant be your own neighbour...! 
continue 
d = distance (point .feature_vector,p.feature_vector) 
if d < epsilon: 
neighbours.append (p) 
return neighbours 


def add_connected(points, all_points, epsilon, min_pts, current_cluster, distance, debug): 
‘rr find every point in the set of all_points which are 
density-connected, starting with the initial points list. '’’ 
cluster_points = [] 
for point in points: 
if not point.visited: 
point.visited = True 
new_points = immediate_neighbours(point, all_points, epsilon, distance, debug) 
if len(new_points) >= min_pts: 
# append any new points on the end of the list we’re 
# already iterating over. 
for p in new_points: 
if p not in points: 
points.append (p) 


# here, we separate ’visited’ from cluster membership, since 
# ‘visited’ only helps keep track of if we’ve checked this 
# point for neighbours. it may or may not have been assessed 
# for cluster membership at that point. 
if not point.cluster: 
cluster_points.append (point) 
point.cluster = current_cluster 
if debug: 
print ’Added points %s’ % print_points(cluster_points) 
return cluster_points 


def dbscan(points, epsilon, min_pts, distance=euclidean, debug=False): 

‘’’ Main dbscan algorithm function. pass in a list of feature 
vectors (most likely a list of lists or a list of arrays), a 
radius epsilon within which to search for neighbouring points, and 
a min_pts, the minimum number of neighbours a point must have 


within the radius epsilon to be considered connected. the default 
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distance metric is euclidean, but another could be used as 
well. your custom distance metric must accept two equal-length 
feature vectors as input as return a distance value. pass in 


debug=True for verbose output.’’’ 


assert isinstance(points, list) 

epsilon = float (epsilon) 

if not isinstance(points[0], Point): 
# only check the first list instance. imperfect, but the lists 
# could be arbitrarily long. 
points = _as_points(points) 


if debug: 
print ‘\nEpsilon: %.2f’ % epsilon 
print ‘Min_Pts: %d’ % min_pts 


clusters = {} # each cluster is a list of points 
clusters[-1] = [] # store all the points deemed noise here. 
current_cluster = -1 


for point in points: 
if not point.visited: 
point.visited = True 
neighbours = immediate_neighbours(point, points, epsilon, distance, debug) 
if len(neighbours) >= min_pts: 
current_cluster += 1 
if debug: 


print ’\nCreating new cluster %d’ % (current_cluster) 
print '%s’ % str(point) 
point.cluster = current_cluster 
cluster = [point,] 
cluster.extend(add_connected(neighbours, points, epsilon, min_pts, 
current_cluster, distance, debug) 
clusters[current_cluster] = cluster 
else: 
clusters [-1] .append (point) 
if debug: 


print ’\nPoint %s has no density-connected neighbours.’ % str(point.feature_vector) 


# return the dictionary of clusters, converting the Point objects 
# in the clusters back to regular lists 
print ‘length of original list: %d’ % len(points) 
returned = 0 
for members in clusters.values(): 
returned += len(members) 


print ‘length of returned points: %d’ % returned 


return as_lists(clusters) 


name. == / main /: 





import random 


epsilon = 2.0 

min_pts = 2.0 

points = [] 
points.append([1,1]) 
points.append([1.5,1]) 
points.append([1.8,1.5]) 
points.append([2.1,1]) 
points.append([3.1,2]) 
points.append([4.1,2]) 


points.append([5.1,2]) 
points.append([10,10]) 
points.append([11,10.5]) 





points.append([9.5,11]) 
points.append([9.9,11.4]) 
points.append([15.0, 17.0]) 
points.append([15.0, 17.0]) 
points.append([7.5, -5.0]) 


clusters = dbscan(points, epsilon, min_pts, debug=True) 
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print ’\n== 





= Results of Clustering == 
for cluster, members in clusters.iteritems(): 
print ’\n-------- Cluster %d--------- ’ % cluster 
for point in members: 


print point 


points = [] 
for i in xrange(100): 


points.append([random.uniform(0.0, 20.0), random.uniform(0.0, 20.0)]) 


clusters = dbscan(points, epsilon, min_pts, debug=True) 
print ’\n== 





=sF 


= Results of Clustering == 





for cluster, members in clusters.iteritems(): 
print ’\n-------- Cluster %d--------- ' % cluster 
for point in members: 


print point 
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APPENDIX B: 
Stop Words 





B.1 Generic Stop Words 


The generic English language stop word list was taken from the Natural Language Toolkit 


(NLTK) [1]. It is comprised of the following words: 


a 
a’s 

able 

about 
above 
according 
accordingly 
across 
actually 
after 
afterwards 
again 
against 
ain’t 

all 

allow 
allows 
almost 
alone 
along 
already 
also 
although 
always 

am 


among 


amongst 
an 

and 
another 
any 
anybody 
anyhow 
anyone 
anything 
anyway 
anyways 
anywhere 
apart 
appear 
appreciate 
appropriate 
are 

aren’t 
around 

as 

aside 

ask 
asking 
associated 
at 
available 


away 
awfully 

b 

be 
became 
because 
become 
becomes 
becoming 
been 
before 
beforehand 
behind 
being 
believe 
below 
beside 
besides 
best 
better 
between 
beyond 
both 
brief 

but 

by 
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can 
can’t 
cannot 

cant 

cause 
causes 
certain 
certainly 
changes 
clearly 

co 

com 

come 
comes 
concerning 
consequently 
consider 
considering 
contain 
containing 
contains 


corresponding 


could 
couldn’t 
course 
currently 
d 
definitely 
described 
despite 
did 
didn’t 
different 
do 

does 
doesn’t 
doing 
don’t 
done 
down 
downwards 
during 

e 

each 

edu 

eg 

eight 
either 


else 
elsewhere 
enough 
entirely 
especially 
et 

etc 

even 

ever 

every 
everybody 
everyone 
everything 
everywhere 
ex 

exactly 
example 


except 


few 

fifth 

first 

five 
followed 
following 
follows 
for 
former 
formerly 
forth 
four 
from 
further 


furthermore 
& 

get 

gets 
getting 
given 
gives 

go 

goes 
going 
gone 
got 
gotten 
greetings 
h 

had 
hadn’t 
happens 
hardly 
has 
hasn’t 
have 
haven’t 
having 
he 

he’s 
hello 
help 
hence 
her 

here 
here’s 
hereafter 


hereby 


herein 
hereupon 
hers 
herself 
hi 

him 
himself 
his 

hither 
hopefully 
how 
howbeit 


however 


if 
ignored 
immediate 
in 
inasmuch 
inc 
indeed 
indicate 
indicated 
indicates 
inner 
insofar 
instead 
into 


inward 
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is 
isn’t 

it 

it'd 
itll 
it’s 

its 
itself 

j 

Just 

k 
keep 
keeps 
kept 
know 
knows 
known 
] 

last 
lately 
later 
latter 
latterly 
least 
less 
lest 
let 
let’s 
like 
liked 
likely 
little 
look 
looking 


mainly 
many 

may 
maybe 

me 

mean 
meanwhile 
merely 
might 
more 
moreover 
most 
mostly 
much 
must 

my 

myself 

n 

name 
namely 

nd 

near 
nearly 
necessary 
need 

needs 
neither 
never 
nevertheless 
new 


next 


nine 

no 
nobody 
non 
none 
noone 
nor 
normally 
not 
nothing 
novel 
now 
nowhere 
0 
obviously 
of 

off 
often 

oh 

ok 

okay 

old 

on 

once 
one 
ones 
only 
onto 

or 

other 
others 
otherwise 
ought 


our 


ours 
ourselves 
out 

outside 
over 
overall 
own 

Pp 

particular 
particularly 
per 

perhaps 
placed 
please 

plus 
possible 
presumably 
probably 


provides 


rather 

rd 

re 

really 
reasonably 
regarding 
regardless 
regards 
relatively 


respectively 


say 
saying 
says 
second 
secondly 
see 
seeing 
seem 
seemed 
seeming 
seems 
seen 

self 
selves 
sensible 
sent 
serious 
seriously 
seven 
several 
shall 

she 
should 
shouldn’t 
since 

SIX 

sO 

some 


somebody 
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somehow 
someone 
something 
sometime 
sometimes 
somewhat 
somewhere 
soon 
Sorry 
specified 
specify 
specifying 
still 

sub 

such 


t’s 
take 
taken 
tell 
tends 
th 
than 
thank 
thanks 
thanx 
that 
that’s 
thats 
the 
their 


theirs 


them 
themselves 
then 
thence 
there 
there’s 
thereafter 
thereby 
therefore 
therein 
theres 
thereupon 
these 

they 
they’d 
they’ Il 
they’re 
they’ve 
think 
third 

this 
thorough 
thoroughly 
those 
though 
three 
through 
throughout 
thru 

thus 

to 
together 
too 

took 


toward 
towards 
tried 
tries 
truly 

try 
trying 
twice 
two 

u 

un 
under 
unfortunately 
unless 
unlikely 
until 
unto 

up 

upon 
us 


use 


used 
useful 
uses 
using 
usually 
uucp 

Vv 

value 
various 
very 
via 


VIZ 


we'll 
we're 
we've 
welcome 
well 
went 
were 
weren't 
what 
what’s 
whatever 
when 
whence 
whenever 
where 
where’s 
whereafter 
whereas 
whereby 
wherein 


whereupon 


B.2. Custom Stop Words 


These stop words were identified by human inspection of the corpus. 


jessy 

cowan 

sharp 
cowansharp 
cowan-sharp 
202 

360 

3967 

http 


www 
org 
com 
net 
arc 
nasa 
gov 
mail 


email 


gmail 

google 
googlegroups 
groups 
subscribe 
unsubscribe 

1 

2 

3 
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wherever 
whether 
which 
while 
whither 
who 
who’s 
whoever 
whole 
whom 
whose 
why 
will 
willing 
wish 
with 
within 
without 
won’t 
wonder 


would 


would 


wouldn’t 


you 
you'd 
you'll 
you're 
you’ ve 
your 
yours 
yourself 
yourselves 
y. 


Zero 


13 
14 
15 
16 
17 
18 
19 
20 
80 


30 

2006 
2007 
2008 
2009 
html 


650 
815 
450 
831 
656 


mailing 


mailman 
listinfo 
lists 

cgi 

bin 

604 


br 

ll 

ve 
div 
ames 


center 
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